You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Hari Shreedharan (JIRA)" <ji...@apache.org> on 2013/09/13 20:31:53 UTC

[jira] [Commented] (FLUME-2190) add a source capable of feeding off of the Twitter Streaming API

    [ https://issues.apache.org/jira/browse/FLUME-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13766777#comment-13766777 ] 

Hari Shreedharan commented on FLUME-2190:
-----------------------------------------

Thanks Roman for the patch. Looks pretty interesting as a demo/example source. Could you please add a section in the Flume User Guide with details on how to use the source. Also, please mark this source as experimental (in the source as well as in the User Guide - use a ".. warning::" in the rst file to make it prominent). I looked at it, and have a few comments:

* There are a bunch of unused imports. Please clean them up.
* Please use the interface stability annotations to mark it as private and stable/unstable.
* There are several lines > 80 chars long. Please make sure lines are <=80 chars.

{code}
super.start();
{code}
This should happen at the end of the start method, since this will change the lifecycle status of the component to tell the Flume framework that this component has started. Doing this at the beginning of the start method tells the framework that the component started successfully, even if the method actually throws later.

{code}
docs = new ArrayList<Record>();
{code}
This can be made final.

{code}
    System.out.println(status.getUser().getName() + " : " + status.getText());
{code}
This should be removed or converted to a log statement with more details.

{code}
// TODO: increment rawBytes?
{code}
Do you want to do this or maybe remove the statement? ;)

{code}
private void addString(Record doc, String solr_field, String val) {
{code}
This should probably just be called "field" or "avroField" rather than solr_field.

* Is the onStatus method guaranteed to be called only from one thread? It does not seem like it is thread-safe, if it can get called from multiple threads.

* The flush based on time works only if onStatus gets called (as in if the method does not get called for hours together, the data in the buffer will not get flushed to the channel, until the method is called at least once - it is unlikely that this will happen, but is worth noting).

{code}
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(avroSchema);
    DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
    dataFileWriter.create(avroSchema, out);
    for (Record doc2 : docList) {
      dataFileWriter.append(doc2);
    }
    dataFileWriter.close();
{code}
Can't all these fields be reused (or is there a thread-safety aspect to it?)? You could simply use the flush method from BAOS to make sure the internal buffer is flushed to the byte array.

* Could you also annotate in the test that it runs only if the user has the relevant system properties set, and has twitter access. Maybe worth considering mocking the twitter classes.

* Another improvement, maybe for the future is to make the serialization pluggable, so the user can plugin the serialization rather than force Avro (like the HTTP Source and Spool Dir Source)?

Otherwise, looks good to go. 
                
> add a source capable of feeding off of the Twitter Streaming API
> ----------------------------------------------------------------
>
>                 Key: FLUME-2190
>                 URL: https://issues.apache.org/jira/browse/FLUME-2190
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>             Fix For: v1.4.1
>
>         Attachments: 0001-FLUME-2190.-add-a-source-capable-of-feeding-off-of-t.patch
>
>
> I would like to propose adding a source capable of connecting via Streaming API to the 1% sample twitter firehose, continously downloading tweets, converting them to Avro format and sending Avro events to a downstream Flume sink.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira