You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dan Feldman <hr...@gmail.com> on 2012/03/29 20:24:46 UTC

Pig not storing/loading Cassandra data properly

Hi,

I'm loading a bunch of data into Pig using CassandraStorage. When I do a
dump and/or store, the amount of data that is outputted is actually only
2-3% of the amount of data in Cassandra database.

My Cassandra data consists of (for now) 4-5 wide rows where each data entry
is a super column ordered by TimeUUID.

So, my script now looks like

rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage() AS
(key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
value)})});
store rows into 'directory/test';

The output that I get when I run the script looks like this (I highlighted
the warnings):

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
messages to: /directory/pig_1333044658058.log
2012-03-29 11:10:58,105 [main] INFO
org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
date "+%y%m%d%H%M%S"
2012-03-29 11:10:58,268 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
2012-03-29 11:10:59,018 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: UNKNOWN
2012-03-29 11:10:59,182 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-03-29 11:10:59,211 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-03-29 11:10:59,211 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-03-29 11:10:59,251 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-03-29 11:10:59,269 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-03-29 11:10:59,292 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-03-29 11:10:59,334 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-03-29 11:10:59,361 [Thread-1] WARN  org.apache.hadoop.mapred.JobClient
- No job jar file set.  User classes may not be found. See JobConf(Class)
or JobConf#setJar(String).
2012-03-29 11:10:59,437 [Thread-1] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : 1
2012-03-29 11:10:59,836 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0001
2012-03-29 11:10:59,836 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
Task:attempt_local_0001_m_000000_0 is done. And is in the process of
commiting
2012-03-29 11:11:01,189 [Thread-2] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
Task attempt_local_0001_m_000000_0 is allowed to commit now
2012-03-29 11:11:01,192 [Thread-2] INFO
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
2012-03-29 11:11:02,714 [Thread-2] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
Task 'attempt_local_0001_m_000000_0' done.
2012-03-29 11:11:04,842 [main] WARN
org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
job job_local_0001
2012-03-29 11:11:04,845 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2012-03-29 11:11:04,845 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
reported below may be incomplete
2012-03-29 11:11:04,847 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
11:11:04    UNKNOWN

Success!

Job Stats (time in seconds):
JobId    Alias    Feature    Outputs
job_local_0001    rows    MAP_ONLY    file:///root/directory/test,

Input(s):
Successfully read records from: "cassandra://Keyspace/ColumnFamily"

Output(s):
Successfully stored records in: "file:///root/directory/test"

Job DAG:
job_local_0001


2012-03-29 11:11:04,849 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!*

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Now, I don't know whether it's related or not to the problem, but I
recently noticed that ILLUSTRATE dumps the data to the terminal before
actually illustrating the schema. It outputs the same amount of data (about
2-3% of the total) as it would if I just ran DUMP or STORE.

I'm using Pig 0.93 in local mode with Cassandra 1.0.8


P.S. I tried setting -Dpig.splitCombination=false as was suggested by Matt
in
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html,
but it didn't help...


Thanks for your help!
Dan F.

Re: Pig not storing/loading Cassandra data properly

Posted by Dan Feldman <hr...@gmail.com>.

Silly me,

CassandraStorage() has a default limit on the number of columns it
retrieves set to 1024. If you need to get more, you can specify it in Pig
script like this, for example:
rows = LOAD 'cassandra://KS/CF?limit=12345' USING CassandraStorage();

Does anyone know who is charge of developing CassandraStorage (or at least
who wrote the original one)? I will probably have more questions about it
but I wouldn't want to spam this list with them (since it seems like
CassandraStorage isn't being used widely in Pig community).

Thanks,
Dan F.

On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com> wrote:

> Hi,
>
> I'm loading a bunch of data into Pig using CassandraStorage. When I do a
> dump and/or store, the amount of data that is outputted is actually only
> 2-3% of the amount of data in Cassandra database.
>
> My Cassandra data consists of (for now) 4-5 wide rows where each data
> entry is a super column ordered by TimeUUID.
>
> So, my script now looks like
>
> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage()
> AS (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> value)})});
> store rows into 'directory/test';
>
> The output that I get when I run the script looks like this (I highlighted
> the warnings):
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /directory/pig_1333044658058.log
> 2012-03-29 11:10:58,105 [main] INFO
> org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> date "+%y%m%d%H%M%S"
> 2012-03-29 11:10:58,268 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: file:///
> 2012-03-29 11:10:59,018 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: UNKNOWN
> 2012-03-29 11:10:59,182 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2012-03-29 11:10:59,251 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
> 2012-03-29 11:10:59,269 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-03-29 11:10:59,292 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-03-29 11:10:59,334 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-03-29 11:10:59,361 [Thread-1] WARN
> org.apache.hadoop.mapred.JobClient - No job jar file set.  User classes may
> not be found. See JobConf(Class) or JobConf#setJar(String).
> 2012-03-29 11:10:59,437 [Thread-1] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths (combined) to process : 1
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0001
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> commiting
> 2012-03-29 11:11:01,189 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task attempt_local_0001_m_000000_0 is allowed to commit now
> 2012-03-29 11:11:01,192 [Thread-2] INFO
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
> of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> 2012-03-29 11:11:02,714 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task 'attempt_local_0001_m_000000_0' done.
> 2012-03-29 11:11:04,842 [main] WARN
> org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
> job job_local_0001
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
> reported below may be incomplete
> 2012-03-29 11:11:04,847 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
> Features
> 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
> 11:11:04    UNKNOWN
>
> Success!
>
> Job Stats (time in seconds):
> JobId    Alias    Feature    Outputs
> job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>
> Input(s):
> Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>
> Output(s):
> Successfully stored records in: "file:///root/directory/test"
>
> Job DAG:
> job_local_0001
>
>
> 2012-03-29 11:11:04,849 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!*
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Now, I don't know whether it's related or not to the problem, but I
> recently noticed that ILLUSTRATE dumps the data to the terminal before
> actually illustrating the schema. It outputs the same amount of data (about
> 2-3% of the total) as it would if I just ran DUMP or STORE.
>
> I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>
>
> P.S. I tried setting -Dpig.splitCombination=false as was suggested by
> Matt in
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html,
> but it didn't help...
>
>
> Thanks for your help!
> Dan F.
>

Re: Pig not storing/loading Cassandra data properly

Posted by Dan Feldman <hr...@gmail.com>.

Still no success in figuring out how to make Pig load and/or store all of
data in Cassandra db.. Maybe I just need to tweak some pig config settings
somewhere? But I don't really know where (files in conf/ don't seem to be
doing I can't really find other info online) to... I've included the
messages I get when running. Just to restate the problem - when I load and
then immediately dump/store data from Cassandra's row, I get about ~1500
columns while there are ~7500 of them in the actual db.

Thanks!

PIG MESSAGES: (this is for a slightly different script, but the problem is
the same)

2012-04-06 17:27:14,689 [main] INFO
org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
date "+%y%m%d%H%M%S"
2012-04-06 17:27:14,907 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: hdfs://localhost:9000
2012-04-06 17:27:15,252 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to map-reduce job tracker at: localhost:9001
2012-04-06 17:27:16,061 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: FILTER
2012-04-06 17:27:16,287 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-04-06 17:27:16,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-04-06 17:27:16,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-04-06 17:27:16,412 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-04-06 17:27:16,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-04-06 17:27:16,430 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- creating jar file Job3074512946558271384.jar
2012-04-06 17:27:25,849 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- jar file Job3074512946558271384.jar created
2012-04-06 17:27:25,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-04-06 17:27:25,928 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-04-06 17:27:26,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-04-06 17:27:26,588 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : 1
2012-04-06 17:27:27,371 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201204061725_0001
2012-04-06 17:27:27,371 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at:
http://localhost:50030/jobdetails.jsp?jobid=job_201204061725_0001
2012-04-06 17:27:46,992 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete
2012-04-06 17:27:57,079 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2012-04-06 17:27:57,081 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
1.0.1    0.9.3-SNAPSHOT    root    2012-04-06 17:27:16    2012-04-06
17:27:57    FILTER

Success!

Job Stats (time in seconds):
JobId    Maps    Reduces    MaxMapTime    MinMapTIme    AvgMapTime
MaxReduceTime    MinReduceTime    AvgReduceTime    Alias    Feature
Outputs
job_201204061725_0001    1    0    9    9    9    0    0    0
cols,filtered,filtered_rows,rows,super_cols    MAP_ONLY
hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144,

Input(s):
Successfully read 4 records (397 bytes) from:
"cassandra://KeySpace/ColumnFamily"

Output(s):
Successfully stored 271 records (31076 bytes) in:
"hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144"

Counters:
Total records written : 271
Total bytes written : 31076
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201204061725_0001


2012-04-06 17:27:57,092 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
****hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144
2012-04-06 17:27:57,109 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1
2012-04-06 17:27:57,109 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1
...


On Mon, Apr 2, 2012 at 2:11 PM, Dan Feldman <hr...@gmail.com> wrote:

> Managed to get MR mode running by specifying HADOOP_CLASSPATH in
> $HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that..
>
> In any case, it seems that Pig continues to misbehave both in MR and local
> modes: counting rows produces 2 results while we know there 5 of them,
> ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset
> of recent data.
>
>
> On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> Looks like you don't have Thrift on your classpath, or the wrong
>> version of thrift.
>>
>> Pig may be doing something weird with splits in local mode. It would
>> be great if you could determine whether (assuming you fix the
>> classpath) the problem happens in local mode only, or both in local
>> and MR modes.
>>
>> D
>>
>> On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <hr...@gmail.com> wrote:
>> > Hi Dmitriy,
>> >
>> > Apologies for the delay - our server was misbehaving so it took a while
>> to
>> > get everything set up on a new one. In any case, we basically cloned
>> > Cassandra from the old one to the new one - running Pig in local mode
>> still
>> > produces wrong number of results. Now, we never ran the scripts in MR
>> mode,
>> > so I don't know whether this is related to the original problem or not,
>> but
>> > this is the error I get when running on top of hadoop:
>> >
>> >
>> ===============================================================================
>> > ....
>> > *2012-04-01 21:32:38,781 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - job job_201203301228_0003 has failed! Stop running all dependent jobs
>> > 2012-04-01 21:32:38,782 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - 100% complete
>> > 2012-04-01 21:32:38,790 [main] ERROR
>> > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
>> > recreate exception from backed error: Error:
>> > java.lang.ClassNotFoundException: org.apache.thrift.TException
>> > 2012-04-01 21:32:38,790 [main] ERROR
>> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
>> > 2012-04-01 21:32:38,791 [main] INFO
>> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
>> > ....
>> >
>> ================================================================================
>> >
>> > Thanks,
>> > Dan F
>> >
>> >
>> > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>> >
>> >> What happens when you run in MR mode instead of local mode?
>> >>
>> >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm loading a bunch of data into Pig using CassandraStorage. When I
>> do a
>> >> > dump and/or store, the amount of data that is outputted is actually
>> only
>> >> > 2-3% of the amount of data in Cassandra database.
>> >> >
>> >> > My Cassandra data consists of (for now) 4-5 wide rows where each data
>> >> entry
>> >> > is a super column ordered by TimeUUID.
>> >> >
>> >> > So, my script now looks like
>> >> >
>> >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING
>> CassandraStorage()
>> >> AS
>> >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
>> >> > value)})});
>> >> > store rows into 'directory/test';
>> >> >
>> >> > The output that I get when I run the script looks like this (I
>> >> highlighted
>> >> > the warnings):
>> >> >
>> >> >
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging
>> error
>> >> > messages to: /directory/pig_1333044658058.log
>> >> > 2012-03-29 11:10:58,105 [main] INFO
>> >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing
>> command :
>> >> > date "+%y%m%d%H%M%S"
>> >> > 2012-03-29 11:10:58,268 [main] INFO
>> >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> >> Connecting
>> >> > to hadoop file system at: file:///
>> >> > 2012-03-29 11:10:59,018 [main] INFO
>> >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> >> > script: UNKNOWN
>> >> > 2012-03-29 11:10:59,182 [main] INFO
>> >> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
>> >> > File concatenation threshold: 100 optimistic? false
>> >> > 2012-03-29 11:10:59,211 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> >> > - MR plan size before optimization: 1
>> >> > 2012-03-29 11:10:59,211 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> >> > - MR plan size after optimization: 1
>> >> > 2012-03-29 11:10:59,251 [main] INFO
>> >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>> added
>> >> > to the job
>> >> > 2012-03-29 11:10:59,269 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to
>> default
>> >> 0.3
>> >> > 2012-03-29 11:10:59,292 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> >> > - Setting up single store job
>> >> > 2012-03-29 11:10:59,334 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 1 map-reduce job(s) waiting for submission.
>> >> > 2012-03-29 11:10:59,361 [Thread-1] WARN
>> >>  org.apache.hadoop.mapred.JobClient
>> >> > - No job jar file set.  User classes may not be found. See
>> JobConf(Class)
>> >> > or JobConf#setJar(String).
>> >> > 2012-03-29 11:10:59,437 [Thread-1] INFO
>> >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> > paths (combined) to process : 1
>> >> > 2012-03-29 11:10:59,836 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - HadoopJobId: job_local_0001
>> >> > 2012-03-29 11:10:59,836 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 0% complete
>> >> > 2012-03-29 11:11:01,185 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>> >> > commiting
>> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>> >> > org.apache.hadoop.mapred.LocalJobRunner -
>> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task attempt_local_0001_m_000000_0 is allowed to commit now
>> >> > 2012-03-29 11:11:01,192 [Thread-2] INFO
>> >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved
>> output
>> >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
>> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>> >> > org.apache.hadoop.mapred.LocalJobRunner -
>> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task 'attempt_local_0001_m_000000_0' done.
>> >> > 2012-03-29 11:11:04,842 [main] WARN
>> >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get
>> RunningJob for
>> >> > job job_local_0001
>> >> > 2012-03-29 11:11:04,845 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 100% complete
>> >> > 2012-03-29 11:11:04,845 [main] INFO
>> >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode.
>> Stats
>> >> > reported below may be incomplete
>> >> > 2012-03-29 11:11:04,847 [main] INFO
>> >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>> >> >
>> >> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
>> >>  Features
>> >> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59
>>  2012-03-29
>> >> > 11:11:04    UNKNOWN
>> >> >
>> >> > Success!
>> >> >
>> >> > Job Stats (time in seconds):
>> >> > JobId    Alias    Feature    Outputs
>> >> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>> >> >
>> >> > Input(s):
>> >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>> >> >
>> >> > Output(s):
>> >> > Successfully stored records in: "file:///root/directory/test"
>> >> >
>> >> > Job DAG:
>> >> > job_local_0001
>> >> >
>> >> >
>> >> > 2012-03-29 11:11:04,849 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - Success!*
>> >> >
>> >> >
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >
>> >> >
>> >> > Now, I don't know whether it's related or not to the problem, but I
>> >> > recently noticed that ILLUSTRATE dumps the data to the terminal
>> before
>> >> > actually illustrating the schema. It outputs the same amount of data
>> >> (about
>> >> > 2-3% of the total) as it would if I just ran DUMP or STORE.
>> >> >
>> >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>> >> >
>> >> >
>> >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
>> >> Matt
>> >> > in
>> >> >
>> >>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
>> >> ,
>> >> > but it didn't help...
>> >> >
>> >> >
>> >> > Thanks for your help!
>> >> > Dan F.
>> >>
>>
>
>

Re: Pig not storing/loading Cassandra data properly

Posted by Dan Feldman <hr...@gmail.com>.

Managed to get MR mode running by specifying HADOOP_CLASSPATH in
$HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that..

In any case, it seems that Pig continues to misbehave both in MR and local
modes: counting rows produces 2 results while we know there 5 of them,
ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset
of recent data.

On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Looks like you don't have Thrift on your classpath, or the wrong
> version of thrift.
>
> Pig may be doing something weird with splits in local mode. It would
> be great if you could determine whether (assuming you fix the
> classpath) the problem happens in local mode only, or both in local
> and MR modes.
>
> D
>
> On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <hr...@gmail.com> wrote:
> > Hi Dmitriy,
> >
> > Apologies for the delay - our server was misbehaving so it took a while
> to
> > get everything set up on a new one. In any case, we basically cloned
> > Cassandra from the old one to the new one - running Pig in local mode
> still
> > produces wrong number of results. Now, we never ran the scripts in MR
> mode,
> > so I don't know whether this is related to the original problem or not,
> but
> > this is the error I get when running on top of hadoop:
> >
> >
> ===============================================================================
> > ....
> > *2012-04-01 21:32:38,781 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - job job_201203301228_0003 has failed! Stop running all dependent jobs
> > 2012-04-01 21:32:38,782 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 100% complete
> > 2012-04-01 21:32:38,790 [main] ERROR
> > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
> > recreate exception from backed error: Error:
> > java.lang.ClassNotFoundException: org.apache.thrift.TException
> > 2012-04-01 21:32:38,790 [main] ERROR
> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
> > 2012-04-01 21:32:38,791 [main] INFO
> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
> > ....
> >
> ================================================================================
> >
> > Thanks,
> > Dan F
> >
> >
> > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> What happens when you run in MR mode instead of local mode?
> >>
> >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I'm loading a bunch of data into Pig using CassandraStorage. When I
> do a
> >> > dump and/or store, the amount of data that is outputted is actually
> only
> >> > 2-3% of the amount of data in Cassandra database.
> >> >
> >> > My Cassandra data consists of (for now) 4-5 wide rows where each data
> >> entry
> >> > is a super column ordered by TimeUUID.
> >> >
> >> > So, my script now looks like
> >> >
> >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING
> CassandraStorage()
> >> AS
> >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> >> > value)})});
> >> > store rows into 'directory/test';
> >> >
> >> > The output that I get when I run the script looks like this (I
> >> highlighted
> >> > the warnings):
> >> >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging
> error
> >> > messages to: /directory/pig_1333044658058.log
> >> > 2012-03-29 11:10:58,105 [main] INFO
> >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing
> command :
> >> > date "+%y%m%d%H%M%S"
> >> > 2012-03-29 11:10:58,268 [main] INFO
> >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting
> >> > to hadoop file system at: file:///
> >> > 2012-03-29 11:10:59,018 [main] INFO
> >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> > script: UNKNOWN
> >> > 2012-03-29 11:10:59,182 [main] INFO
> >> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> >> > File concatenation threshold: 100 optimistic? false
> >> > 2012-03-29 11:10:59,211 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size before optimization: 1
> >> > 2012-03-29 11:10:59,211 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size after optimization: 1
> >> > 2012-03-29 11:10:59,251 [main] INFO
> >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added
> >> > to the job
> >> > 2012-03-29 11:10:59,269 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to
> default
> >> 0.3
> >> > 2012-03-29 11:10:59,292 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - Setting up single store job
> >> > 2012-03-29 11:10:59,334 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 1 map-reduce job(s) waiting for submission.
> >> > 2012-03-29 11:10:59,361 [Thread-1] WARN
> >>  org.apache.hadoop.mapred.JobClient
> >> > - No job jar file set.  User classes may not be found. See
> JobConf(Class)
> >> > or JobConf#setJar(String).
> >> > 2012-03-29 11:10:59,437 [Thread-1] INFO
> >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input
> >> > paths (combined) to process : 1
> >> > 2012-03-29 11:10:59,836 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - HadoopJobId: job_local_0001
> >> > 2012-03-29 11:10:59,836 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 0% complete
> >> > 2012-03-29 11:11:01,185 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> >> > commiting
> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
> >> > org.apache.hadoop.mapred.LocalJobRunner -
> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task attempt_local_0001_m_000000_0 is allowed to commit now
> >> > 2012-03-29 11:11:01,192 [Thread-2] INFO
> >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved
> output
> >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
> >> > org.apache.hadoop.mapred.LocalJobRunner -
> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task 'attempt_local_0001_m_000000_0' done.
> >> > 2012-03-29 11:11:04,842 [main] WARN
> >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob
> for
> >> > job job_local_0001
> >> > 2012-03-29 11:11:04,845 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 100% complete
> >> > 2012-03-29 11:11:04,845 [main] INFO
> >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode.
> Stats
> >> > reported below may be incomplete
> >> > 2012-03-29 11:11:04,847 [main] INFO
> >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >> >
> >> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
> >>  Features
> >> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59
>  2012-03-29
> >> > 11:11:04    UNKNOWN
> >> >
> >> > Success!
> >> >
> >> > Job Stats (time in seconds):
> >> > JobId    Alias    Feature    Outputs
> >> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
> >> >
> >> > Input(s):
> >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
> >> >
> >> > Output(s):
> >> > Successfully stored records in: "file:///root/directory/test"
> >> >
> >> > Job DAG:
> >> > job_local_0001
> >> >
> >> >
> >> > 2012-03-29 11:11:04,849 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - Success!*
> >> >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> >
> >> >
> >> > Now, I don't know whether it's related or not to the problem, but I
> >> > recently noticed that ILLUSTRATE dumps the data to the terminal before
> >> > actually illustrating the schema. It outputs the same amount of data
> >> (about
> >> > 2-3% of the total) as it would if I just ran DUMP or STORE.
> >> >
> >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
> >> >
> >> >
> >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
> >> Matt
> >> > in
> >> >
> >>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
> >> ,
> >> > but it didn't help...
> >> >
> >> >
> >> > Thanks for your help!
> >> > Dan F.
> >>
>

Re: Pig not storing/loading Cassandra data properly

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Looks like you don't have Thrift on your classpath, or the wrong
version of thrift.

Pig may be doing something weird with splits in local mode. It would
be great if you could determine whether (assuming you fix the
classpath) the problem happens in local mode only, or both in local
and MR modes.

D

On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <hr...@gmail.com> wrote:
> Hi Dmitriy,
>
> Apologies for the delay - our server was misbehaving so it took a while to
> get everything set up on a new one. In any case, we basically cloned
> Cassandra from the old one to the new one - running Pig in local mode still
> produces wrong number of results. Now, we never ran the scripts in MR mode,
> so I don't know whether this is related to the original problem or not, but
> this is the error I get when running on top of hadoop:
>
> ===============================================================================
> ....
> *2012-04-01 21:32:38,781 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - job job_201203301228_0003 has failed! Stop running all dependent jobs
> 2012-04-01 21:32:38,782 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2012-04-01 21:32:38,790 [main] ERROR
> org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
> recreate exception from backed error: Error:
> java.lang.ClassNotFoundException: org.apache.thrift.TException
> 2012-04-01 21:32:38,790 [main] ERROR
> org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
> 2012-04-01 21:32:38,791 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
> ....
> ================================================================================
>
> Thanks,
> Dan F
>
>
> On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> What happens when you run in MR mode instead of local mode?
>>
>> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I'm loading a bunch of data into Pig using CassandraStorage. When I do a
>> > dump and/or store, the amount of data that is outputted is actually only
>> > 2-3% of the amount of data in Cassandra database.
>> >
>> > My Cassandra data consists of (for now) 4-5 wide rows where each data
>> entry
>> > is a super column ordered by TimeUUID.
>> >
>> > So, my script now looks like
>> >
>> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage()
>> AS
>> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
>> > value)})});
>> > store rows into 'directory/test';
>> >
>> > The output that I get when I run the script looks like this (I
>> highlighted
>> > the warnings):
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
>> > messages to: /directory/pig_1333044658058.log
>> > 2012-03-29 11:10:58,105 [main] INFO
>> > org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
>> > date "+%y%m%d%H%M%S"
>> > 2012-03-29 11:10:58,268 [main] INFO
>> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> Connecting
>> > to hadoop file system at: file:///
>> > 2012-03-29 11:10:59,018 [main] INFO
>> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> > script: UNKNOWN
>> > 2012-03-29 11:10:59,182 [main] INFO
>> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
>> > File concatenation threshold: 100 optimistic? false
>> > 2012-03-29 11:10:59,211 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> > - MR plan size before optimization: 1
>> > 2012-03-29 11:10:59,211 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> > - MR plan size after optimization: 1
>> > 2012-03-29 11:10:59,251 [main] INFO
>> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
>> > to the job
>> > 2012-03-29 11:10:59,269 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> > - mapred.job.reduce.markreset.buffer.percent is not set, set to default
>> 0.3
>> > 2012-03-29 11:10:59,292 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> > - Setting up single store job
>> > 2012-03-29 11:10:59,334 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - 1 map-reduce job(s) waiting for submission.
>> > 2012-03-29 11:10:59,361 [Thread-1] WARN
>>  org.apache.hadoop.mapred.JobClient
>> > - No job jar file set.  User classes may not be found. See JobConf(Class)
>> > or JobConf#setJar(String).
>> > 2012-03-29 11:10:59,437 [Thread-1] INFO
>> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> > paths (combined) to process : 1
>> > 2012-03-29 11:10:59,836 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - HadoopJobId: job_local_0001
>> > 2012-03-29 11:10:59,836 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - 0% complete
>> > 2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
>> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>> > commiting
>> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>> > org.apache.hadoop.mapred.LocalJobRunner -
>> > 2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
>> > Task attempt_local_0001_m_000000_0 is allowed to commit now
>> > 2012-03-29 11:11:01,192 [Thread-2] INFO
>> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
>> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
>> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>> > org.apache.hadoop.mapred.LocalJobRunner -
>> > 2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
>> > Task 'attempt_local_0001_m_000000_0' done.
>> > 2012-03-29 11:11:04,842 [main] WARN
>> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
>> > job job_local_0001
>> > 2012-03-29 11:11:04,845 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - 100% complete
>> > 2012-03-29 11:11:04,845 [main] INFO
>> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
>> > reported below may be incomplete
>> > 2012-03-29 11:11:04,847 [main] INFO
>> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>> >
>> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
>>  Features
>> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
>> > 11:11:04    UNKNOWN
>> >
>> > Success!
>> >
>> > Job Stats (time in seconds):
>> > JobId    Alias    Feature    Outputs
>> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>> >
>> > Input(s):
>> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>> >
>> > Output(s):
>> > Successfully stored records in: "file:///root/directory/test"
>> >
>> > Job DAG:
>> > job_local_0001
>> >
>> >
>> > 2012-03-29 11:11:04,849 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - Success!*
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >
>> >
>> > Now, I don't know whether it's related or not to the problem, but I
>> > recently noticed that ILLUSTRATE dumps the data to the terminal before
>> > actually illustrating the schema. It outputs the same amount of data
>> (about
>> > 2-3% of the total) as it would if I just ran DUMP or STORE.
>> >
>> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>> >
>> >
>> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
>> Matt
>> > in
>> >
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
>> ,
>> > but it didn't help...
>> >
>> >
>> > Thanks for your help!
>> > Dan F.
>>

Re: Pig not storing/loading Cassandra data properly

Posted by Dan Feldman <hr...@gmail.com>.

Hi Dmitriy,

Apologies for the delay - our server was misbehaving so it took a while to
get everything set up on a new one. In any case, we basically cloned
Cassandra from the old one to the new one - running Pig in local mode still
produces wrong number of results. Now, we never ran the scripts in MR mode,
so I don't know whether this is related to the original problem or not, but
this is the error I get when running on top of hadoop:

===============================================================================
....
*2012-04-01 21:32:38,781 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201203301228_0003 has failed! Stop running all dependent jobs
2012-04-01 21:32:38,782 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2012-04-01 21:32:38,790 [main] ERROR
org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
recreate exception from backed error: Error:
java.lang.ClassNotFoundException: org.apache.thrift.TException
2012-04-01 21:32:38,790 [main] ERROR
org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2012-04-01 21:32:38,791 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
....
================================================================================

Thanks,
Dan F


On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> What happens when you run in MR mode instead of local mode?
>
> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com>
> wrote:
> > Hi,
> >
> > I'm loading a bunch of data into Pig using CassandraStorage. When I do a
> > dump and/or store, the amount of data that is outputted is actually only
> > 2-3% of the amount of data in Cassandra database.
> >
> > My Cassandra data consists of (for now) 4-5 wide rows where each data
> entry
> > is a super column ordered by TimeUUID.
> >
> > So, my script now looks like
> >
> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage()
> AS
> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> > value)})});
> > store rows into 'directory/test';
> >
> > The output that I get when I run the script looks like this (I
> highlighted
> > the warnings):
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
> > messages to: /directory/pig_1333044658058.log
> > 2012-03-29 11:10:58,105 [main] INFO
> > org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> > date "+%y%m%d%H%M%S"
> > 2012-03-29 11:10:58,268 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting
> > to hadoop file system at: file:///
> > 2012-03-29 11:10:59,018 [main] INFO
> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> > script: UNKNOWN
> > 2012-03-29 11:10:59,182 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> > File concatenation threshold: 100 optimistic? false
> > 2012-03-29 11:10:59,211 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size before optimization: 1
> > 2012-03-29 11:10:59,211 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size after optimization: 1
> > 2012-03-29 11:10:59,251 [main] INFO
> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> > to the job
> > 2012-03-29 11:10:59,269 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - mapred.job.reduce.markreset.buffer.percent is not set, set to default
> 0.3
> > 2012-03-29 11:10:59,292 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - Setting up single store job
> > 2012-03-29 11:10:59,334 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 1 map-reduce job(s) waiting for submission.
> > 2012-03-29 11:10:59,361 [Thread-1] WARN
>  org.apache.hadoop.mapred.JobClient
> > - No job jar file set.  User classes may not be found. See JobConf(Class)
> > or JobConf#setJar(String).
> > 2012-03-29 11:10:59,437 [Thread-1] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> > paths (combined) to process : 1
> > 2012-03-29 11:10:59,836 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - HadoopJobId: job_local_0001
> > 2012-03-29 11:10:59,836 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 0% complete
> > 2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> > commiting
> > 2012-03-29 11:11:01,189 [Thread-2] INFO
> > org.apache.hadoop.mapred.LocalJobRunner -
> > 2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> > Task attempt_local_0001_m_000000_0 is allowed to commit now
> > 2012-03-29 11:11:01,192 [Thread-2] INFO
> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> > 2012-03-29 11:11:02,714 [Thread-2] INFO
> > org.apache.hadoop.mapred.LocalJobRunner -
> > 2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> > Task 'attempt_local_0001_m_000000_0' done.
> > 2012-03-29 11:11:04,842 [main] WARN
> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
> > job job_local_0001
> > 2012-03-29 11:11:04,845 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 100% complete
> > 2012-03-29 11:11:04,845 [main] INFO
> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
> > reported below may be incomplete
> > 2012-03-29 11:11:04,847 [main] INFO
> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >
> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
>  Features
> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
> > 11:11:04    UNKNOWN
> >
> > Success!
> >
> > Job Stats (time in seconds):
> > JobId    Alias    Feature    Outputs
> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
> >
> > Input(s):
> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
> >
> > Output(s):
> > Successfully stored records in: "file:///root/directory/test"
> >
> > Job DAG:
> > job_local_0001
> >
> >
> > 2012-03-29 11:11:04,849 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!*
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> >
> > Now, I don't know whether it's related or not to the problem, but I
> > recently noticed that ILLUSTRATE dumps the data to the terminal before
> > actually illustrating the schema. It outputs the same amount of data
> (about
> > 2-3% of the total) as it would if I just ran DUMP or STORE.
> >
> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
> >
> >
> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
> Matt
> > in
> >
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
> ,
> > but it didn't help...
> >
> >
> > Thanks for your help!
> > Dan F.
>

Re: Pig not storing/loading Cassandra data properly

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

What happens when you run in MR mode instead of local mode?

On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <hr...@gmail.com> wrote:
> Hi,
>
> I'm loading a bunch of data into Pig using CassandraStorage. When I do a
> dump and/or store, the amount of data that is outputted is actually only
> 2-3% of the amount of data in Cassandra database.
>
> My Cassandra data consists of (for now) 4-5 wide rows where each data entry
> is a super column ordered by TimeUUID.
>
> So, my script now looks like
>
> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage() AS
> (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> value)})});
> store rows into 'directory/test';
>
> The output that I get when I run the script looks like this (I highlighted
> the warnings):
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /directory/pig_1333044658058.log
> 2012-03-29 11:10:58,105 [main] INFO
> org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> date "+%y%m%d%H%M%S"
> 2012-03-29 11:10:58,268 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: file:///
> 2012-03-29 11:10:59,018 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: UNKNOWN
> 2012-03-29 11:10:59,182 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2012-03-29 11:10:59,251 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
> 2012-03-29 11:10:59,269 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-03-29 11:10:59,292 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-03-29 11:10:59,334 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-03-29 11:10:59,361 [Thread-1] WARN  org.apache.hadoop.mapred.JobClient
> - No job jar file set.  User classes may not be found. See JobConf(Class)
> or JobConf#setJar(String).
> 2012-03-29 11:10:59,437 [Thread-1] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths (combined) to process : 1
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0001
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> commiting
> 2012-03-29 11:11:01,189 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task attempt_local_0001_m_000000_0 is allowed to commit now
> 2012-03-29 11:11:01,192 [Thread-2] INFO
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
> of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> 2012-03-29 11:11:02,714 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task 'attempt_local_0001_m_000000_0' done.
> 2012-03-29 11:11:04,842 [main] WARN
> org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
> job job_local_0001
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
> reported below may be incomplete
> 2012-03-29 11:11:04,847 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
> 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
> 11:11:04    UNKNOWN
>
> Success!
>
> Job Stats (time in seconds):
> JobId    Alias    Feature    Outputs
> job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>
> Input(s):
> Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>
> Output(s):
> Successfully stored records in: "file:///root/directory/test"
>
> Job DAG:
> job_local_0001
>
>
> 2012-03-29 11:11:04,849 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!*
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Now, I don't know whether it's related or not to the problem, but I
> recently noticed that ILLUSTRATE dumps the data to the terminal before
> actually illustrating the schema. It outputs the same amount of data (about
> 2-3% of the total) as it would if I just ran DUMP or STORE.
>
> I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>
>
> P.S. I tried setting -Dpig.splitCombination=false as was suggested by Matt
> in
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html,
> but it didn't help...
>
>
> Thanks for your help!
> Dan F.