You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Stephen Watt <sw...@us.ibm.com> on 2010/07/12 20:28:21 UTC

Hadoop Compression - Current Status

Please let me know if any of assertions are incorrect. I'm going to be 
adding any feedback to the Hadoop Wiki. It seems well documented that the 
LZO Codec is the most performant codec (
http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 
but it is GPL infected and thus it is separately maintained here - 
http://github.com/kevinweil/hadoop-lzo. 

With regards to performance, and if you are not using sequential files, 
Gzip is the next best codec to use, followed by bzip2. Hadoop has 
supported being able to process bzip2 and gzip input formats for awhile 
now but it could never split the files. i.e. it assigned one mapper per 
file. There are now 2 new features :
- Splitting bzip2 files available in 0.21.0 - 
https://issues.apache.org/jira/browse/HADOOP-4012
- Splitting gzip files (in progress but patch available) - 
https://issues.apache.org/jira/browse/MAPREDUCE-491

1) It appears most folks are using LZO. Given that it is GPL, are you not 
worried about it virally infecting your project ?
2) Is anyone using the new bzip2 or gzip file split compatible readers? 
How do you like them? General feedback?

Kind regards
Steve Watt

Re: Hadoop Compression - Current Status

Posted by Owen O'Malley <om...@apache.org>.

On Jul 12, 2010, at 11:28 AM, Stephen Watt wrote:

> 1) It appears most folks are using LZO. Given that it is GPL, are  
> you not
> worried about it virally infecting your project ?

The lzo bindings are not part of Hadoop and therefore can't infect  
Hadoop. They are a separate project (hadoop-gpl-compression) that  
depends on lzo and Hadoop. Hadoop-gpl-compression is distributed under  
the GPL, which satisfies lzo's licensing. That said, there is an  
effort underway to make bindings for one of the fast ASL-friendly  
codecs.

-- Owen

IOException: Owner 'mapred' for path XY not match expected owner 'AB'

Posted by Mathias Walter <ma...@gmx.net>.

Hi Guys,

recently I upgraded to the recent Claudera Hadoop distribution. It contains hadoop-core-0.20.2+737.jar. If I now run my map job, I
get the following exception for a few tasks:

java.io.IOException: Owner 'mapred' for path
/hadoop/hdfs5/tmp/taskTracker/mathias.walter/jobcache/job_201010210928_0005/attempt_201010210928_0005_m_000000_0/output/spill437.out
.index did not match expected owner 'mathias.walter'
	at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:182)
	at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:108)
	at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
	at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:55)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1480)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1172)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:574)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:641)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
	at org.apache.hadoop.mapred.Child.main(Child.java:211)

A total of 8 tasks are running in parallel. They are finished after about 8 hours, but some of them (19) were crashed with the above
exception.

Why are so many tasks crashed, but some not?

--
Kind regards,
Mathias

RE: IOException: Owner 'mapred' for path XY not match expected owner 'AB'

Posted by "Segel, Mike" <ms...@navteq.com>.

Yeah...

You need to go through each node and check to make sure all of your ownerships and permission levels are set correctly. 
It's a pain in the ass, but look on the bright side. You only have to do it once. :-)

-Mike

-----Original Message-----
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Tuesday, October 26, 2010 8:04 AM
To: common-dev@hadoop.apache.org
Subject: Re: IOException: Owner 'mapred' for path XY not match expected owner 'AB'

Hi Matthias,

Best I can guess, you have uneven permissions on some of your
mapred.local.dir, causing tasks that run using those directories to fail.
See if these are all owned by user:group 'mapred:hadoop', and have
drwxr-xr-x permissions.

Regards,

- Patrick
On Tue, Oct 26, 2010 at 3:34 AM, Mathias Walter <ma...@gmx.net>wrote:

> Hi Guys,
>
> recently I upgraded to the recent Claudera Hadoop distribution. It contains
> hadoop-core-0.20.2+737.jar. If I now run my map job, I
> get the following exception for a few tasks:
>
> java.io.IOException: Owner 'mapred' for path
>
> /hadoop/hdfs5/tmp/taskTracker/mathias.walter/jobcache/job_201010210928_0005/attempt_201010210928_0005_m_000000_0/output/spill437.out
> .index did not match expected owner 'mathias.walter'
>        at
> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:182)
>        at
> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:108)
>        at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
>        at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:55)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1480)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1172)
>        at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:574)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:641)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> A total of 8 tasks are running in parallel. They are finished after about 8
> hours, but some of them (19) were crashed with the above
> exception.
>
> Why are so many tasks crashed, but some not?
>
> --
> Kind regards,
> Mathias
>
>

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: IOException: Owner 'mapred' for path XY not match expected owner 'AB'

Posted by Patrick Angeles <pa...@cloudera.com>.

Hi Matthias,

Best I can guess, you have uneven permissions on some of your
mapred.local.dir, causing tasks that run using those directories to fail.
See if these are all owned by user:group 'mapred:hadoop', and have
drwxr-xr-x permissions.

Regards,

- Patrick
On Tue, Oct 26, 2010 at 3:34 AM, Mathias Walter <ma...@gmx.net>wrote:

> Hi Guys,
>
> recently I upgraded to the recent Claudera Hadoop distribution. It contains
> hadoop-core-0.20.2+737.jar. If I now run my map job, I
> get the following exception for a few tasks:
>
> java.io.IOException: Owner 'mapred' for path
>
> /hadoop/hdfs5/tmp/taskTracker/mathias.walter/jobcache/job_201010210928_0005/attempt_201010210928_0005_m_000000_0/output/spill437.out
> .index did not match expected owner 'mathias.walter'
>        at
> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:182)
>        at
> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:108)
>        at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
>        at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:55)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1480)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1172)
>        at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:574)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:641)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> A total of 8 tasks are running in parallel. They are finished after about 8
> hours, but some of them (19) were crashed with the above
> exception.
>
> Why are so many tasks crashed, but some not?
>
> --
> Kind regards,
> Mathias
>
>

Re: Hadoop Compression - Current Status

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Steve,

Owen, can you elaborate a little on the effort for the ASL friendly codec
> that you mentioned?
>

See the work on FastLZ at https://issues.apache.org/jira/browse/HADOOP-6349.


Regards,
Jeff

RE: Hadoop Compression - Current Status

Posted by Stephen Watt <sw...@us.ibm.com>.

Mike, for certain compression formats like LZO, you CAN split them, as 
they use a block compression algorithm. The input formatter for the LZO 
file type has the ability to determine where the block markers are inside 
the compressed file. This is why LZO is so popular. If you read my 
original note, it states that Gzip and Bzip2 previously did not have 
InputFormatters that were able to do the same (i.e. 1 map per file - not 
optimal, as you state), but now with the 2 new JIRAs I point to,  from 
Hadoop 0.21 onwards, those compression types will have Input Formatters 
that will be able to split them at specific block markers in a similar 
manner to what is presently being done with LZO.

Owen, can you elaborate a little on the effort for the ASL friendly codec 
that you mentioned? 

Kind regards
Steve Watt

From:
"Segel, Mike" <ms...@navteq.com>
To:
"common-dev@hadoop.apache.org" <co...@hadoop.apache.org>
Date:
07/14/2010 10:30 AM
Subject:
RE: Hadoop Compression - Current Status

Sorry for the delay in responding back...

Yes, that's kind of my point. 

You gain some efficiency, however... currently you have an expense of 
losing your parallelism which really gives you more bang for your buck.

I'm not sure what I can say about stuff going on at my current client, but 
I can say the following...

We're storing records in HBase using a SHA-1 hash as the record key. So 
we're getting good distribution across the cloud when the tables get 
large.

So suppose we're running a job where we want to run a process that 
accesses 100K records.
If the table only contains those 100K records, we have fewer region 
servers so we have fewer splits.
If the table contains 15 million rows, and we still want to only process 
those 100K records, we'll get more splits, and better utilization of the 
cloud.

Granted that this is HBase and not strictly hadoop, but the point remains 
the same. You become more efficient through parallelism and when you 
restrict your ability to run m/r tasks in parallel, your overall time is 
constrained.

So until you get MAPREDUCE-491 or the hadoop-lzo input formats, I think 
Stephen's assertion is incorrect.

Now while this is a bit of a nit, because Stephen seems to be concerned 
about a 'poisoned GPL', his comment about performance is incorrect.

It seems your performance is going to be better not using something that 
restricts your # of m/r tasks.

-Mike

-----Original Message-----
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf 
Of Patrick Angeles
Sent: Monday, July 12, 2010 2:13 PM
To: common-dev@hadoop.apache.org
Subject: Re: Hadoop Compression - Current Status

Also, fwiw, the use of codecs and SequenceFiles are somewhat orthogonal.
You'll have to compress the sequencefile with a codec, be it gzip, bz2 or
lzo. SequenceFiles do get you splittability which you won't get with just
Gzip (until we get MAPREDUCE-491) or the hadoop-lzo InputFormats.

cheers,

- Patrick

On Mon, Jul 12, 2010 at 2:42 PM, Segel, Mike <ms...@navteq.com> wrote:

> How can you say zip files are 'best codecs' to use?
>
> Call me silly but I seem to recall that if you're using a zip'd file for
> input you can't really use a file splitter?
> (Going from memory, which isn't the best thing to do...)
>
> -Mike
>
>
> -----Original Message-----
> From: Stephen Watt [mailto:swatt@us.ibm.com]
> Sent: Monday, July 12, 2010 1:28 PM
> To: common-dev@hadoop.apache.org
> Subject: Hadoop Compression - Current Status
>
> Please let me know if any of assertions are incorrect. I'm going to be
> adding any feedback to the Hadoop Wiki. It seems well documented that 
the
> LZO Codec is the most performant codec (
> 
http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html)
> but it is GPL infected and thus it is separately maintained here -
> http://github.com/kevinweil/hadoop-lzo.
>
> With regards to performance, and if you are not using sequential files,
> Gzip is the next best codec to use, followed by bzip2. Hadoop has
> supported being able to process bzip2 and gzip input formats for awhile
> now but it could never split the files. i.e. it assigned one mapper per
> file. There are now 2 new features :
> - Splitting bzip2 files available in 0.21.0 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> - Splitting gzip files (in progress but patch available) -
> https://issues.apache.org/jira/browse/MAPREDUCE-491
>
> 1) It appears most folks are using LZO. Given that it is GPL, are you 
not
> worried about it virally infecting your project ?
> 2) Is anyone using the new bzip2 or gzip file split compatible readers?
> How do you like them? General feedback?
>
> Kind regards
> Steve Watt
>
>
> The information contained in this communication may be CONFIDENTIAL and 
is
> intended only for the use of the recipient(s) named above.  If you are 
not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, 
is
> strictly prohibited.  If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>

The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not 
the intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, 
please notify the sender and delete/destroy the original message and any 
copy of it from your computer or paper files.

RE: Hadoop Compression - Current Status

Posted by "Segel, Mike" <ms...@navteq.com>.

Sorry for the delay in responding back...

Yes, that's kind of my point. 

You gain some efficiency, however... currently you have an expense of losing your parallelism which really gives you more bang for your buck.

I'm not sure what I can say about stuff going on at my current client, but I can say the following...

We're storing records in HBase using a SHA-1 hash as the record key. So we're getting good distribution across the cloud when the tables get large.

So suppose we're running a job where we want to run a process that accesses 100K records.
If the table only contains those 100K records, we have fewer region servers so we have fewer splits.
If the table contains 15 million rows, and we still want to only process those 100K records, we'll get more splits, and better utilization of the cloud.

Granted that this is HBase and not strictly hadoop, but the point remains the same. You become more efficient through parallelism and when you restrict your ability to run m/r tasks in parallel, your overall time is constrained.

So until you get MAPREDUCE-491 or the hadoop-lzo input formats, I think Stephen's assertion is incorrect.

Now while this is a bit of a nit, because Stephen seems to be concerned about a 'poisoned GPL', his comment about performance is incorrect.

It seems your performance is going to be better not using something that restricts your # of m/r tasks.

-Mike

-----Original Message-----
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Monday, July 12, 2010 2:13 PM
To: common-dev@hadoop.apache.org
Subject: Re: Hadoop Compression - Current Status

Also, fwiw, the use of codecs and SequenceFiles are somewhat orthogonal.
You'll have to compress the sequencefile with a codec, be it gzip, bz2 or
lzo. SequenceFiles do get you splittability which you won't get with just
Gzip (until we get MAPREDUCE-491) or the hadoop-lzo InputFormats.

cheers,

- Patrick

On Mon, Jul 12, 2010 at 2:42 PM, Segel, Mike <ms...@navteq.com> wrote:

> How can you say zip files are 'best codecs' to use?
>
> Call me silly but I seem to recall that if you're using a zip'd file for
> input you can't really use a file splitter?
> (Going from memory, which isn't the best thing to do...)
>
> -Mike
>
>
> -----Original Message-----
> From: Stephen Watt [mailto:swatt@us.ibm.com]
> Sent: Monday, July 12, 2010 1:28 PM
> To: common-dev@hadoop.apache.org
> Subject: Hadoop Compression - Current Status
>
> Please let me know if any of assertions are incorrect. I'm going to be
> adding any feedback to the Hadoop Wiki. It seems well documented that the
> LZO Codec is the most performant codec (
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html)
> but it is GPL infected and thus it is separately maintained here -
> http://github.com/kevinweil/hadoop-lzo.
>
> With regards to performance, and if you are not using sequential files,
> Gzip is the next best codec to use, followed by bzip2. Hadoop has
> supported being able to process bzip2 and gzip input formats for awhile
> now but it could never split the files. i.e. it assigned one mapper per
> file. There are now 2 new features :
> - Splitting bzip2 files available in 0.21.0 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> - Splitting gzip files (in progress but patch available) -
> https://issues.apache.org/jira/browse/MAPREDUCE-491
>
> 1) It appears most folks are using LZO. Given that it is GPL, are you not
> worried about it virally infecting your project ?
> 2) Is anyone using the new bzip2 or gzip file split compatible readers?
> How do you like them? General feedback?
>
> Kind regards
> Steve Watt
>
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above.  If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited.  If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Hadoop Compression - Current Status

Posted by Patrick Angeles <pa...@cloudera.com>.

Also, fwiw, the use of codecs and SequenceFiles are somewhat orthogonal.
You'll have to compress the sequencefile with a codec, be it gzip, bz2 or
lzo. SequenceFiles do get you splittability which you won't get with just
Gzip (until we get MAPREDUCE-491) or the hadoop-lzo InputFormats.

cheers,

- Patrick

On Mon, Jul 12, 2010 at 2:42 PM, Segel, Mike <ms...@navteq.com> wrote:

> How can you say zip files are 'best codecs' to use?
>
> Call me silly but I seem to recall that if you're using a zip'd file for
> input you can't really use a file splitter?
> (Going from memory, which isn't the best thing to do...)
>
> -Mike
>
>
> -----Original Message-----
> From: Stephen Watt [mailto:swatt@us.ibm.com]
> Sent: Monday, July 12, 2010 1:28 PM
> To: common-dev@hadoop.apache.org
> Subject: Hadoop Compression - Current Status
>
> Please let me know if any of assertions are incorrect. I'm going to be
> adding any feedback to the Hadoop Wiki. It seems well documented that the
> LZO Codec is the most performant codec (
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html)
> but it is GPL infected and thus it is separately maintained here -
> http://github.com/kevinweil/hadoop-lzo.
>
> With regards to performance, and if you are not using sequential files,
> Gzip is the next best codec to use, followed by bzip2. Hadoop has
> supported being able to process bzip2 and gzip input formats for awhile
> now but it could never split the files. i.e. it assigned one mapper per
> file. There are now 2 new features :
> - Splitting bzip2 files available in 0.21.0 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> - Splitting gzip files (in progress but patch available) -
> https://issues.apache.org/jira/browse/MAPREDUCE-491
>
> 1) It appears most folks are using LZO. Given that it is GPL, are you not
> worried about it virally infecting your project ?
> 2) Is anyone using the new bzip2 or gzip file split compatible readers?
> How do you like them? General feedback?
>
> Kind regards
> Steve Watt
>
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above.  If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited.  If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>

RE: Hadoop Compression - Current Status

Posted by "Segel, Mike" <ms...@navteq.com>.

How can you say zip files are 'best codecs' to use?

Call me silly but I seem to recall that if you're using a zip'd file for input you can't really use a file splitter?
(Going from memory, which isn't the best thing to do...)

-Mike


-----Original Message-----
From: Stephen Watt [mailto:swatt@us.ibm.com] 
Sent: Monday, July 12, 2010 1:28 PM
To: common-dev@hadoop.apache.org
Subject: Hadoop Compression - Current Status

Please let me know if any of assertions are incorrect. I'm going to be 
adding any feedback to the Hadoop Wiki. It seems well documented that the 
LZO Codec is the most performant codec (
http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 
but it is GPL infected and thus it is separately maintained here - 
http://github.com/kevinweil/hadoop-lzo. 

With regards to performance, and if you are not using sequential files, 
Gzip is the next best codec to use, followed by bzip2. Hadoop has 
supported being able to process bzip2 and gzip input formats for awhile 
now but it could never split the files. i.e. it assigned one mapper per 
file. There are now 2 new features :
- Splitting bzip2 files available in 0.21.0 - 
https://issues.apache.org/jira/browse/HADOOP-4012
- Splitting gzip files (in progress but patch available) - 
https://issues.apache.org/jira/browse/MAPREDUCE-491

1) It appears most folks are using LZO. Given that it is GPL, are you not 
worried about it virally infecting your project ?
2) Is anyone using the new bzip2 or gzip file split compatible readers? 
How do you like them? General feedback?

Kind regards
Steve Watt


The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited.  If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.

Re: Hadoop Compression - Current Status

Posted by Greg Roelofs <ro...@yahoo-inc.com>.

Stephen Watt <sw...@us.ibm.com> wrote:

> Please let me know if any of assertions are incorrect. I'm going to be 
> adding any feedback to the Hadoop Wiki. It seems well documented that the 
> LZO Codec is the most performant codec (
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 

Speedwise, yes.

> 1) It appears most folks are using LZO. Given that it is GPL, are you not 
> worried about it virally infecting your project ?

"Viral infections" are more of a Windows concept, not so much source code or
licenses.

There are literally _piles_ of information available on this, and you really
should go read up on it.  But the upshot is that (1) GPLv2 triggers for
distribution, not use, and (2) even if you're distributing in violation of
the license, the worst that can happen is that you lose all privileges with
respect to the GPL'd code and perhaps have to pay damages for copyright
infringement.  It can't "infect" your own code, but to the extent that the
combined work is legally considered a derived work, you can be barred from
distributing the combination.

Greg