You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shawna Qian <sh...@yahoo-inc.com> on 2012/05/10 15:13:01 UTC

SSTableWriter to hdfs

Hi

Can I use sstableunsortedwriter to write the data directly to hdfs or I have to use hdfs copyfromlocal to copy the sstable file from local dist to hdfs after they get generated?

Thx
Shawna

Sent from my iPhone

On May 7, 2012, at 3:48 AM, "aaron morton" <aa...@thelastpickle.com>> wrote:

Can you copy the sstables as a task after the load operation ? You should know where the files are.

The are multiple files may be created by the writer during the loading process. So running code that performs a long running action will impact on the time taken to pump data through the SSTableSimpleUnsortedWriter.

wrt the patch, the best place to start the conversation for this is on <https://issues.apache.org/jira/browse/CASSANDRA> https://issues.apache.org/jira/browse/CASSANDRA

Thanks taking the time to look into this.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
<http://www.thelastpickle.com>http://www.thelastpickle.com

On 3/05/2012, at 11:40 PM, Benoit Perroud wrote:

Hi All,

I'm bulk loading (a lot of) data from Hadoop into Cassandra 1.0.x. The
provided CFOutputFormat is not the best case here, I wanted to use the
bulk loading feature. I know 1.1 comes with a BulkOutputFormat but I
wanted to propose a simple enhancement to SSTableSimpleUnsortedWriter
that could ease life :

When the table is flushed into the disk, it could be interesting to
have listeners that could be triggered to perform any action (copying
my SSTable into HDFS for instance).

Please have a look at the patch below to give a better idea. Do you
think it could worth while opening a jira for this ?


Regarding 1.1 BulkOutputFormat and bulk in general, the work done to
have light client to stream into the cluster is really great. The
issue now is that data is streamed at the end of the task only. This
cause all the tasks storing the data locally and streaming everything
at the end. Lot's of temporary space may be needed, and lot of
bandwidth to the nodes are used at the "same" time. With the listener,
we would be able to start streaming as soon the first table is
created. That way the streaming bandwidth could be better balanced.
Jira for this also ?

Thanks

Benoit.




--- a/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
+++ b/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
@@ -21,6 +21,8 @@ package org.apache.cassandra.io.sstable;
import java.io.File;
import java.io.IOException;
import java.nio.ByteBuffer;
+import java.util.LinkedList;
+import java.util.List;
import java.util.Map;
import java.util.TreeMap;

@@ -47,6 +49,8 @@ public class SSTableSimpleUnsortedWriter extends
AbstractSSTableSimpleWriter
    private final long bufferSize;
    private long currentSize;

+    private final List<SSTableWriterListener> sSTableWrittenListeners
= new LinkedList<SSTableWriterListener>();
+
    /**
     * Create a new buffering writer.
     * @param directory the directory where to write the sstables
@@ -123,5 +127,16 @@ public class SSTableSimpleUnsortedWriter extends
AbstractSSTableSimpleWriter
        }
        currentSize = 0;
        keys.clear();
+
+        // Notify the registered listeners
+        for (SSTableWriterListener listeners : sSTableWrittenListeners)
+        {
+
listeners.onSSTableWrittenAndClosed(writer.getTableName(),
writer.getColumnFamilyName(), writer.getFilename());
+        }
+    }
+
+    public void addSSTableWriterListener(SSTableWriterListener listener)
+    {
+       sSTableWrittenListeners.add(listener);
    }
}
diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
new file mode 100644
index 0000000..6628d20
--- /dev/null
+++ b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
@@ -0,0 +1,9 @@
+package org.apache.cassandra.io.sstable;
+
+import java.io.IOException;
+
+public interface SSTableWriterListener {
+
+       void onSSTableWrittenAndClosed(final String tableName, final
String columnFamilyName, final String filename) throws IOException;
+
+}

Re: SSTableWriter to hdfs

Posted by aaron morton <aa...@thelastpickle.com>.

Jeremy do you know the best approach here ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/05/2012, at 1:13 AM, Shawna Qian wrote:

> Hi 
> 
> Can I use sstableunsortedwriter to write the data directly to hdfs or I have to use hdfs copyfromlocal to copy the sstable file from local dist to hdfs after they get generated?
> 
> Thx
> Shawna
> 
> Sent from my iPhone
> 
> On May 7, 2012, at 3:48 AM, "aaron morton" <aa...@thelastpickle.com> wrote:
> 
>> Can you copy the sstables as a task after the load operation ? You should know where the files are. 
>> 
>> The are multiple files may be created by the writer during the loading process. So running code that performs a long running action will impact on the time taken to pump data through the SSTableSimpleUnsortedWriter.
>> 
>> wrt the patch, the best place to start the conversation for this is on https://issues.apache.org/jira/browse/CASSANDRA 
>> 
>> Thanks taking the time to look into this. 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 3/05/2012, at 11:40 PM, Benoit Perroud wrote:
>> 
>>> Hi All,
>>> 
>>> I'm bulk loading (a lot of) data from Hadoop into Cassandra 1.0.x. The
>>> provided CFOutputFormat is not the best case here, I wanted to use the
>>> bulk loading feature. I know 1.1 comes with a BulkOutputFormat but I
>>> wanted to propose a simple enhancement to SSTableSimpleUnsortedWriter
>>> that could ease life :
>>> 
>>> When the table is flushed into the disk, it could be interesting to
>>> have listeners that could be triggered to perform any action (copying
>>> my SSTable into HDFS for instance).
>>> 
>>> Please have a look at the patch below to give a better idea. Do you
>>> think it could worth while opening a jira for this ?
>>> 
>>> 
>>> Regarding 1.1 BulkOutputFormat and bulk in general, the work done to
>>> have light client to stream into the cluster is really great. The
>>> issue now is that data is streamed at the end of the task only. This
>>> cause all the tasks storing the data locally and streaming everything
>>> at the end. Lot's of temporary space may be needed, and lot of
>>> bandwidth to the nodes are used at the "same" time. With the listener,
>>> we would be able to start streaming as soon the first table is
>>> created. That way the streaming bandwidth could be better balanced.
>>> Jira for this also ?
>>> 
>>> Thanks
>>> 
>>> Benoit.
>>> 
>>> 
>>> 
>>> 
>>> --- a/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
>>> +++ b/src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
>>> @@ -21,6 +21,8 @@ package org.apache.cassandra.io.sstable;
>>> import java.io.File;
>>> import java.io.IOException;
>>> import java.nio.ByteBuffer;
>>> +import java.util.LinkedList;
>>> +import java.util.List;
>>> import java.util.Map;
>>> import java.util.TreeMap;
>>> 
>>> @@ -47,6 +49,8 @@ public class SSTableSimpleUnsortedWriter extends
>>> AbstractSSTableSimpleWriter
>>>     private final long bufferSize;
>>>     private long currentSize;
>>> 
>>> +    private final List<SSTableWriterListener> sSTableWrittenListeners
>>> = new LinkedList<SSTableWriterListener>();
>>> +
>>>     /**
>>>      * Create a new buffering writer.
>>>      * @param directory the directory where to write the sstables
>>> @@ -123,5 +127,16 @@ public class SSTableSimpleUnsortedWriter extends
>>> AbstractSSTableSimpleWriter
>>>         }
>>>         currentSize = 0;
>>>         keys.clear();
>>> +
>>> +        // Notify the registered listeners
>>> +        for (SSTableWriterListener listeners : sSTableWrittenListeners)
>>> +        {
>>> +
>>> listeners.onSSTableWrittenAndClosed(writer.getTableName(),
>>> writer.getColumnFamilyName(), writer.getFilename());
>>> +        }
>>> +    }
>>> +
>>> +    public void addSSTableWriterListener(SSTableWriterListener listener)
>>> +    {
>>> +       sSTableWrittenListeners.add(listener);
>>>     }
>>> }
>>> diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> new file mode 100644
>>> index 0000000..6628d20
>>> --- /dev/null
>>> +++ b/src/java/org/apache/cassandra/io/sstable/SSTableWriterListener.java
>>> @@ -0,0 +1,9 @@
>>> +package org.apache.cassandra.io.sstable;
>>> +
>>> +import java.io.IOException;
>>> +
>>> +public interface SSTableWriterListener {
>>> +
>>> +       void onSSTableWrittenAndClosed(final String tableName, final
>>> String columnFamilyName, final String filename) throws IOException;
>>> +
>>> +}
>>