You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Robbie Strickland (JIRA)" <ji...@apache.org> on 2012/05/01 18:58:50 UTC

[jira] [Created] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Robbie Strickland created CASSANDRA-4208:
--------------------------------------------

             Summary: ColumnFamilyOutputFormat should support writing to multiple column families
                 Key: CASSANDRA-4208
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
             Project: Cassandra
          Issue Type: Improvement
          Components: Hadoop
    Affects Versions: 1.1.0
            Reporter: Robbie Strickland


It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267524#comment-13267524 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

@Jonathan: Yes that is the patch, although the Hadoop patch is not required as long as you have the latest in trunk.  The Hadoop patch just moves the call to set the base name out of FileOutputFormat and into OutputFormat--as a matter of principle and to avoid potential future issues.

@Jake: Yes it is different. I examined prior branches to see where the changes were made, and it's only in trunk--which is why I didn't see it until checking out trunk to make the changes.  

It probably makes sense to do a patch against Hadoop 1.0.2 and Cassandra 1.1 so people can use a release version.  This is definitely doable without significant effort.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265911#comment-13265911 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

There is an API change, so when you do a context.write(), the signature now takes in a Pair<String, ByteBuffer> instead of just a ByteBuffer.  I also changed ConfigHelper.setOutputColumnFamily() to setOutputKeyspace() and removed CF-related checks and config keys.  It broke my existing reducers, but it's also an easy fix and adds tremendous value IMHO.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489225#comment-13489225 ] 

Jonathan Ellis commented on CASSANDRA-4208:
-------------------------------------------

Reverted the BOF change in 78d6f64f33c592890051c690ddf5d26b7b2af027
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460763#comment-13460763 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

You mean BulkOutputFormat isn't working, or MO isn't working at all?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265931#comment-13265931 ] 

Jonathan Ellis commented on CASSANDRA-4208:
-------------------------------------------

Are you familiar with the Hadoop MultipleOutputs api?  Seems like that's the "right" way to do this.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460797#comment-13460797 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

[~mkjellman] your usage is correct.  What this patch does is actually change the ConfigHelper so set/getColumnFamily() operates on the mapreduce.output.basename key that MultipleOutputs (and FileInput/OutputFormat) uses when it's looking for outputs.  This is a bit hacky but unavoidable since methods to alter this through the Hadoop API are inaccessible.  I have a related ticket on the Hadoop side to change this and make it more generic, but until then this will have to do. 
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the column family name on that job.

additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled. 
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266626#comment-13266626 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

I spent a good bit of time analyzing the changes needed to make this work using MultipleOutputs, and it would involve:

1. Removing hard-coded references to WritableComparable and Writable in MultipleOutputs.getNamedOutputKeyClass() and getNamedOutputValueClass().
2. Removing hard-coded call to FileOutputFormat.setOutputName() in getRecordWriter().
3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be made generic. An alernative is a default no-op implementation so it doesn't break existing output formats who don't care about this.
4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set the config property for the CF (where the "name" corresponds to CF).
5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() is just a pass-through to setOutputName() (or vice versa).

This solution would allow MultipleOutputs support in conformance with the existing API, and it should not break any existing reducer code.  I don't personally love the boilerplate it adds to my reducer, and I think it's much less obvious than handling it at the write() call, but I can get over that if I have to. :)  I am willing to do the work on both sides if this is where the consensus is, though I don't know what the response will be in the Hadoop community.

Thoughts?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: trunk-4208-v3.txt

I've attached the new patch (v3) rebased against trunk.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266685#comment-13266685 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

I would think the Hadoop community would go for it since they already do so much to decouple MR from HDFS.

Let's ping them and see what they think, otherwise we could go with the less portable solution.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462802#comment-13462802 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

Not a problem.  I'll do so when I get back from Strange Loop... :)
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271761#comment-13271761 ] 

T Jake Luciani edited comment on CASSANDRA-4208 at 5/9/12 8:20 PM:
-------------------------------------------------------------------

I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts.  Would you be able to make it backwards compatible by adding the old  public static setOutputColumnFamily( public static void setOutputColumnFamily(Configuration conf, String keyspace, String columnFamily)) back and using the new setColumnFamily() in there?

                
      was (Author: tjake):
    I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts.  Would you be able to make it backwards compatible by adding the old constructor back and using the setColumnFamily() in there?


                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481501#comment-13481501 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

[~mkjellman] - I think the BOF support should be in a separate issue, since CFOF and BOF don't depend on each other for the MultipleOutputs functionality--and because this issue specifically addresses CFOF.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0 beta 2
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265989#comment-13265989 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

Looking a bit closer at the MultipleOutputs class, it seems pretty tied to FileOutputFormat. So if we go this route we're probably looking at a separate CassandraMultipleOutputs with little re-use from MultipleOutputs. We could re-use the config keys, but we'd have to duplicate the strings since they're private. Am I missing something that makes this more straightforward?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272313#comment-13272313 ] 

Robbie Strickland edited comment on CASSANDRA-4208 at 5/10/12 1:21 PM:
-----------------------------------------------------------------------

I've attached a patch that adds a  setOutputColumnFamily() overload that takes in both keyspace and CF.  The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs.  

Unfortunately MultipleOutputs.getNamedOutputsList(), which would be the right way to do this, is currently private.  So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves.  Not sure which is the lesser of two evils. 
                
      was (Author: rstrickland):
    I've attached a patch that adds a  setOutputColumnFamily() overload that takes in both keyspace and CF.  The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs.  

Unfortunately MultipleOutputs.getNamedOutputsList()--which would be the right way to do this--is currently private.  So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves.  Not sure which is the lesser of two evils. 
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: cassandra-1.1-4208.txt

It appears I was mistaken about the MultipleOutputs issue being resolved only in trunk.  It's resolved in the mapred package in trunk, but the new version in mapreduce dates at least back to 1.0.1.  It still references FileOutputFormat, but the attached patch gets around this by using the same config key.  I have attached a new patch based against Cassandra 1.1 and Hadoop 1.0.2.  Changes are actually minimal.  Let me know your thoughts...
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266032#comment-13266032 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

@Jake: MultipleOutputs is the class we've been referring to in the above posts, and it was around pre-1.0. Did you mean to refer to something else?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460770#comment-13460770 ] 

Michael Kjellman edited comment on CASSANDRA-4208 at 9/22/12 6:45 AM:
----------------------------------------------------------------------

Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the column family name on that job.

additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled. 

i would love to see the patch you are proposing to fix the issue for bulkoutputformat :)
                
      was (Author: mkjellman):
    Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class, ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class) FileOutputFormat throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:127)

If i do specify that at the job level the job name never seems to to set the column family name on that job.

additionally, using the job name as the column family name is slightly inconvenient as we use '_' in our column family names which is not a valid character in MultipleOutputs as it looks like _# is the way they internally keep track of counters if that is enabled. 
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272370#comment-13272370 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

Well, there is always http://tutorials.jenkov.com/java-reflection/private-fields-and-methods.html#methods

We use something like this in FBUtilities for accessing protected fields.  

I don't know how much worry a NPE should be, you could just add a log message if column family isn't set so people can see it before the NPE and realize they did something wrong.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458247#comment-13458247 ] 

Michael Kjellman edited comment on CASSANDRA-4208 at 9/19/12 9:36 AM:
----------------------------------------------------------------------

yes- we have it working as well (and thanks for the patch, it's a really important feature to us). but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today
                
      was (Author: mkjellman):
    yes- we have it working as well. but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460807#comment-13460807 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

You don't need the Hadoop patch to make this work.  I think I'm confused as to whether you're having trouble getting this to work at all, or just with BOF.  As I mentioned I have not tested this with BOF, but it is working against 1.1.x & Hadoop 1.0.2 using CFOF.  Look here for an example that works with CFOF: https://gist.github.com/3763728.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266082#comment-13266082 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

My bad. I didn't notice the linked issue.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481523#comment-13481523 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

Robbie - I'm okay with that. but not sure then we should have the BOF patch you provided applied if it doesn't work. I'm still working on debugging exactly why it doesn't stream but getting an environment setup to debug the whole process has been difficult.

If anything maybe we should revert the change to BOF keep the other changes and then open another BOF bug for multiple output support?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0 beta 2
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460800#comment-13460800 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

I applied the patch to Hadoop 1.0.3 as well. Are you suggesting then that for now this patch assumes those methods are still private?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479381#comment-13479381 ] 

Michael Kjellman edited comment on CASSANDRA-4208 at 10/18/12 9:42 PM:
-----------------------------------------------------------------------

Jake or Robbie -- have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs. The sstables are created for both column families but the reducer never streams the data to the nodes.
                
      was (Author: mkjellman):
    Jake or Robbie -- have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs.
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0 beta 2
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265914#comment-13265914 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

I should note it would be easy to make this work with previous releases if desired.  I think that was your real question... :)
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458247#comment-13458247 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

yes- we have it working as well. but so far we have been unsuccessful in getting it to work with bulkoutputformat...i'm going to work on debugging that today
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Comment: was deleted

(was: I created an issue regarding the specificity of MultipleOutputs to FileOutputFormat. Linked here as an FYI.)
    
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: trunk-4208-v2.txt

I've added a patch to allow support for MultipleOutputs. Hadoop trunk now contains a new version of MultipleOutputs that should support this out of the box, although I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats.

The basic solution involves changing the config key for output CF to match the "basename" key being written by MultipleOutputs. I had to make related changes to CassandraStorage and TestRingCache, as well as some minor changes to ColumnFamilyInputFormat to account for some interface changes in Hadoop trunk.

So the bottom line is this will work if people use Hadoop and Cassandra trunk with both patches applied. The original patch can be used as a temporary solution if needed.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266018#comment-13266018 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

Could this be accomplished using http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html?

It was recently added to Hadoop 1.0.2
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: cassandra-1.1-4208-v4.txt

I've attached a new patch that removes the check for a null output CF on BulkOutputFormat.  This allows BOF to use the MultipleOutputs API.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276874#comment-13276874 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

Any word on this?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460763#comment-13460763 ] 

Robbie Strickland edited comment on CASSANDRA-4208 at 9/22/12 6:42 AM:
-----------------------------------------------------------------------

You mean BulkOutputFormat isn't working, or MO isn't working at all?  BulkOutputFormat isn't working because it's still checking to make sure the output CF has been set and throwing an exception otherwise.  I'm happy to remove this check but we don't use BOF so I don't have the bandwidth to test.  I'll create the patch if you want to do so.
                
      was (Author: rstrickland):
    You mean BulkOutputFormat isn't working, or MO isn't working at all?
                  
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-4208.
---------------------------------------

    Resolution: Fixed
    
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460745#comment-13460745 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

so i've been working on this for a few days. As far as I can tell this is not working with 1.1.5 and 1.0.3. I've gone through and svn blammed and it doesn't look like anything exciting has really changed in the mapreduce code. Robbie have you tested this on the current GA versions?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479381#comment-13479381 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

Jake or Robbie -- have you tested this with BOF? I've confirmed that it looks like this only streams one of the two named multiple outputs.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0 beta 2
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: trunk-4208.txt
    
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457343#comment-13457343 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

The attached patch works and we have it running in production.  I'm not sure why I haven't received any response since May on whether this will be included in some future release.  I presume everyone is busy on other features.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411459#comment-13411459 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

I'd like to know if this is going to be included or if another direction is preferred.  Any update?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267498#comment-13267498 ] 

Jonathan Ellis commented on CASSANDRA-4208:
-------------------------------------------

bq. I am submitting a patch to deal with an inconsistency that could cause future issues with non-file formats

On MAPREDUCE-4216 or elsewhere?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460798#comment-13460798 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

I had already done what your patch contains. Only one SSTable gets created. Have you tested that patch? Am i missing something obvious with the job config requirements? 
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reopened CASSANDRA-4208:
---------------------------------------

    
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0 beta 2
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488417#comment-13488417 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

Another question. why are we targeting 1.0.2 instead of 1.0.3 in build.xml?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: cassandra-1.1-4208-v2.txt

I've attached a patch that adds a  setOutputColumnFamily() overload that takes in both keyspace and CF.  The one outstanding issue that I've commented on in CFOF is that checkOutputSpecs() cannot currently ensure that a CF has been specified either through setOutputColumnFamily() or MultipleOutputs.  

Unfortunately MultipleOutputs.getNamedOutputsList()--which would be the right way to do this--is currently private.  So we either don't do the check and let it throw an NPE at runtime, or we duplicate the code in MultipleOutputs to grab the values from config ourselves.  Not sure which is the lesser of two evils. 
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271747#comment-13271747 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

Any word on whether this solution is getting the thumbs up?  I personally need this functionality and would like to proceed in a manner that will ultimately be accepted by the community.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265963#comment-13265963 ] 

Robbie Strickland commented on CASSANDRA-4208:
----------------------------------------------

We could use MultipleOutputs if you think that's better, though the implementation is certainly less trivial than what I've done here. Upside is of course sticking with the convention. I'm not really sure it gets us any more than that, and personally I think it adds unnecessary complexity to an already convoluted API. Passing in a CF at the call level is more intuitive and will be more familiar to Cassandra users, IMHO. But I'm happy to work on the MultipleOutputs version if that's the consensus.
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457310#comment-13457310 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

any additional updates on this? Robbie -- what direction did you decide to pursue?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267507#comment-13267507 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

@Robbie is the version in hadoop trunk different than the version included in MAPREDUCE-3607?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266690#comment-13266690 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

@Robbie can you post your code analysis on the hadoop ticket?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Robbie Strickland (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robbie Strickland updated CASSANDRA-4208:
-----------------------------------------

    Attachment: cassandra-1.1-4208-v3.txt

Here's a new patch that handles the potential NPE on getOutputColumnFamily() and throws a more descriptive exception. 
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271761#comment-13271761 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

I'm ok with this now that it works with MultipleOutputs (nice find), though I'm not sure if it should be in 1.1 since it would break existing scripts.  Would you be able to make it backwards compatible by adding the old constructor back and using the setColumnFamily() in there?


                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, trunk-4208-v2.txt, trunk-4208.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "Michael Kjellman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488000#comment-13488000 ] 

Michael Kjellman commented on CASSANDRA-4208:
---------------------------------------------

so are we going to revert commit e05a5fc12648f315002c9939a2a0748d74525589 and recommit minus the changes in the patch for BOF?
                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>            Assignee: Robbie Strickland
>             Fix For: 1.2.0
>
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt, trunk-4208-v3.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462791#comment-13462791 ] 

T Jake Luciani commented on CASSANDRA-4208:
-------------------------------------------

Hi Robbie, ready to commit this but the issue is we don't want to change hadoop versions on a stable branch 1.1

Could you rebase your patch for trunk?  1.2 should be out soon.


                
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4208
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4208
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt, cassandra-1.1-4208-v4.txt, trunk-4208.txt, trunk-4208-v2.txt
>
>
> It is not currently possible to output records to more than one column family in a single reducer.  Considering that writing values to Cassandra often involves multiple column families (i.e. updating your index when you insert a new value), this seems overly restrictive.  I am submitting a patch that moves the specification of column family from the job configuration to the write() call in ColumnFamilyRecordWriter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira