You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ssbiox <se...@gmail.com> on 2015/07/11 04:48:59 UTC

Linear search between particular log4j log lines

Hello,

I have a very specific question on how to do a search between particular
lines of log file. I did some research to find the answer and what I learned
is that if one of the shuffle operation applied to RDD, there is no a way to
"reconstruct" the sequence of line (except zipping with id). I'm looking for
any useful approaches/workarounds how other developers solve that problem.

Here is a sample:
I have log4j log files where for each request/transaction a specific BEGIN
and END transaction marker is printed. Somewhere in between other classes
may report useful statistics, which is needed to parse, and unfortunately
there is now a way to keep transaction id with that record. What is the best
approach to link transaction with particular line between BEGIN and END
markers?

Assume, only timestamp and thread name are available:
2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN
TransactionID=AA000000001
2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs}
2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms
2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs}
2015-01-01 20:00:05 DEBUG className [Thread-0] - END

Finally, I want to get the result with transaction ID AA000000001 and SQL
execution time 500ms.

Probably, another good example would be - extracting java stacktrace from
logs, when stacktrace lines wouldn't have any key strings (timestamp, thread
id) at all to parse by.

So far I've come up with one "idea" and one approach:
1) Find out the file and position of BEGIN line and run separate non-Spark
process to parse it line-by-line. In this case the question is what is the
best approach to know to which file this line belongs to, and what is the
position? Is zipWithUniqueId helpful for that? Not sure if it's really
effective and can help to find the file name (or may be hadoop partition).

2) I use thread id as a key and map that key with BEGIN / END lines. Then I
create another RDD with the same key, but for SQL execution time line. Then
I do left join of RDDs by thread id and filter by timestamps, coming from
both RDDs: leaving only this SQL line which is prior to END line (SQL's
timestamp is before END's timestamp).
Approach like this becomes very confusing in cases when it's required to
extract more information (lines) between BEGIN/END. Is there any
recommendations how to handle cases like that?

Thank you,
Sergey



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Linear search between particular log4j log lines

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Can you not use sc.wholeTextFile() and use a custom parser or a regex to
extract out the TransactionIDs?

Thanks
Best Regards

On Sat, Jul 11, 2015 at 8:18 AM, ssbiox <se...@gmail.com> wrote:

> Hello,
>
> I have a very specific question on how to do a search between particular
> lines of log file. I did some research to find the answer and what I
> learned
> is that if one of the shuffle operation applied to RDD, there is no a way
> to
> "reconstruct" the sequence of line (except zipping with id). I'm looking
> for
> any useful approaches/workarounds how other developers solve that problem.
>
> Here is a sample:
> I have log4j log files where for each request/transaction a specific BEGIN
> and END transaction marker is printed. Somewhere in between other classes
> may report useful statistics, which is needed to parse, and unfortunately
> there is now a way to keep transaction id with that record. What is the
> best
> approach to link transaction with particular line between BEGIN and END
> markers?
>
> Assume, only timestamp and thread name are available:
> 2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN
> TransactionID=AA000000001
> 2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs}
> 2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms
> 2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs}
> 2015-01-01 20:00:05 DEBUG className [Thread-0] - END
>
> Finally, I want to get the result with transaction ID AA000000001 and SQL
> execution time 500ms.
>
> Probably, another good example would be - extracting java stacktrace from
> logs, when stacktrace lines wouldn't have any key strings (timestamp,
> thread
> id) at all to parse by.
>
> So far I've come up with one "idea" and one approach:
> 1) Find out the file and position of BEGIN line and run separate non-Spark
> process to parse it line-by-line. In this case the question is what is the
> best approach to know to which file this line belongs to, and what is the
> position? Is zipWithUniqueId helpful for that? Not sure if it's really
> effective and can help to find the file name (or may be hadoop partition).
>
> 2) I use thread id as a key and map that key with BEGIN / END lines. Then I
> create another RDD with the same key, but for SQL execution time line. Then
> I do left join of RDDs by thread id and filter by timestamps, coming from
> both RDDs: leaving only this SQL line which is prior to END line (SQL's
> timestamp is before END's timestamp).
> Approach like this becomes very confusing in cases when it's required to
> extract more information (lines) between BEGIN/END. Is there any
> recommendations how to handle cases like that?
>
> Thank you,
> Sergey
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>