You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Yan Fang <ya...@gmail.com> on 2014/04/30 02:33:53 UTC

Review Request 20869: SAMZA-138

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20869/
-----------------------------------------------------------

Review request for samza.


Repository: samza-hello-samza


Description
-------

a filereader system which reads specified file contents onto stream.

1. FileReaderSystemAdmin implements SystemAdmin and is similar to SinglePartitionWithoutOffsetsSystemAdmin except it updates offset.
2. FileReaderSystemFactory implements SystemFactory. It throws a Samza exception for getProducer because this system does not suppose to write to files.
3. FileReaderSystemConsumer gets file path from stream names, creates threads for each file and then read files line-by-line. It skips certain lines when its startingOffset is not zero, meaning it has checkpoint.


Diffs
-----

  samza-wikipedia/src/main/java/samza/examples/file/system/FileReaderAdmin.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/file/system/FileReaderConsumer.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/file/system/FileReaderSystemFactory.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderAdmin.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderConsumer.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemFactory.java PRE-CREATION 

Diff: https://reviews.apache.org/r/20869/diff/


Testing
-------


Thanks,

Yan Fang


Re: Review Request 20869: SAMZA-138

Posted by Yan Fang <ya...@gmail.com>.

> On April 30, 2014, 6:27 p.m., Chris Riccomini wrote:
> > 1. Stylistic: white space should be spaces, not tabs.
> > 2. Can we put this in the main code base, rather than in hello-samza? Hello-samza can still use it, but this seems like a really useful consumer. I think samza-core is appropriate for it.
> > 3. If we do (2), would you feel comfortable converting this to Scala? Happy to provide Scala guidance, if you need it.
> > 4. Need tests.
> > 5. I see that your implementation right now treats an offset as a line (e.g. 0 means the first line, 1 means the second, etc). This makes things a lot easier to implement internally, but means we have to read the entire file every time we start up. If we use the actual file offset for the \n location, we can do an fseek, which is much faster.

3. If we want to put them in core base, I can convert this to Scala. It may take a little effort, but it's fine to me.
4. Will add tests.
5. agree. Will update this.


> On April 30, 2014, 6:27 p.m., Chris Riccomini wrote:
> > samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java, line 40
> > <https://reviews.apache.org/r/20869/diff/2/?file=571202#file571202line40>
> >
> >     This is not quite correct. Rather than use a static FilePartitionMetadata and hard code made-up offsets, you'll need to uniquely determine the oldest, newest, and upcoming offsets for every stream (file).
> >     
> >     The oldest offset will always be 0.
> >     
> >     The newest offset will always be the offset of second-to-last \n in the file + 1. The upcoming offset will be the offset of the last \n in the file + 1.
> >     
> >     msg1\n <-- oldest offset
> >     msg2\n <-- newest offset
> >     msg3\n <-- upcoming offset
> >     not-yet-complete-msg-that-has-no-newline-after-it-yet

oldest offset makes sense to me.
In order to get the newest offset, do I have to create a kafka consumer to get current last \n? Otherwise, it seems no way I can get the file pos. (Read the file in this class?)


- Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20869/#review41856
-----------------------------------------------------------


On April 30, 2014, 12:42 a.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20869/
> -----------------------------------------------------------
> 
> (Updated April 30, 2014, 12:42 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza-hello-samza
> 
> 
> Description
> -------
> 
> a filereader system which reads specified file contents onto stream.
> 
> 1. FileReaderSystemAdmin implements SystemAdmin and is similar to SinglePartitionWithoutOffsetsSystemAdmin except it updates offset.
> 2. FileReaderSystemFactory implements SystemFactory. It throws a Samza exception for getProducer because this system does not suppose to write to files.
> 3. FileReaderSystemConsumer gets file path from stream names, creates threads for each file and then read files line-by-line. It skips certain lines when its startingOffset is not zero, meaning it has checkpoint.
> 
> 
> Diffs
> -----
> 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java PRE-CREATION 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java PRE-CREATION 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemFactory.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20869/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 20869: SAMZA-138

Posted by Chris Riccomini <cr...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20869/#review41856
-----------------------------------------------------------


1. Stylistic: white space should be spaces, not tabs.
2. Can we put this in the main code base, rather than in hello-samza? Hello-samza can still use it, but this seems like a really useful consumer. I think samza-core is appropriate for it.
3. If we do (2), would you feel comfortable converting this to Scala? Happy to provide Scala guidance, if you need it.
4. Need tests.
5. I see that your implementation right now treats an offset as a line (e.g. 0 means the first line, 1 means the second, etc). This makes things a lot easier to implement internally, but means we have to read the entire file every time we start up. If we use the actual file offset for the \n location, we can do an fseek, which is much faster.


samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java
<https://reviews.apache.org/r/20869/#comment75498>

    Javadocs here would be useful.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java
<https://reviews.apache.org/r/20869/#comment75497>

    Some java docs describing behavior would be good here.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java
<https://reviews.apache.org/r/20869/#comment75496>

    This is not quite correct. Rather than use a static FilePartitionMetadata and hard code made-up offsets, you'll need to uniquely determine the oldest, newest, and upcoming offsets for every stream (file).
    
    The oldest offset will always be 0.
    
    The newest offset will always be the offset of second-to-last \n in the file + 1. The upcoming offset will be the offset of the last \n in the file + 1.
    
    msg1\n <-- oldest offset
    msg2\n <-- newest offset
    msg3\n <-- upcoming offset
    not-yet-complete-msg-that-has-no-newline-after-it-yet



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java
<https://reviews.apache.org/r/20869/#comment75499>

    Javadocs here would be useful.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java
<https://reviews.apache.org/r/20869/#comment75500>

    You can't add 1 here. You have to find the location of the next newline in the file after the supplied offset.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75501>

    Stylistic: keep the ASF header 80 char aligned. Easiest way is to just copy from an existing file:
    
    /*
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *   http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing,
     * software distributed under the License is distributed on an
     * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     * KIND, either express or implied.  See the License for the
     * specific language governing permissions and limitations
     * under the License.
     */



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75502>

    Javadocs.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75507>

    Stylistic: camel case in variable names.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75508>

    Do you need metricsRegistry? If so, should be Registry, not REgistry.



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75510>

    Keep track of these threads so we can shut them down in stop().



samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java
<https://reviews.apache.org/r/20869/#comment75509>

    Stop all reader threads that were created in start.


- Chris Riccomini


On April 30, 2014, 12:42 a.m., Yan Fang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20869/
> -----------------------------------------------------------
> 
> (Updated April 30, 2014, 12:42 a.m.)
> 
> 
> Review request for samza.
> 
> 
> Repository: samza-hello-samza
> 
> 
> Description
> -------
> 
> a filereader system which reads specified file contents onto stream.
> 
> 1. FileReaderSystemAdmin implements SystemAdmin and is similar to SinglePartitionWithoutOffsetsSystemAdmin except it updates offset.
> 2. FileReaderSystemFactory implements SystemFactory. It throws a Samza exception for getProducer because this system does not suppose to write to files.
> 3. FileReaderSystemConsumer gets file path from stream names, creates threads for each file and then read files line-by-line. It skips certain lines when its startingOffset is not zero, meaning it has checkpoint.
> 
> 
> Diffs
> -----
> 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java PRE-CREATION 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java PRE-CREATION 
>   samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemFactory.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20869/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Yan Fang
> 
>


Re: Review Request 20869: SAMZA-138

Posted by Yan Fang <ya...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20869/
-----------------------------------------------------------

(Updated April 30, 2014, 12:42 a.m.)


Review request for samza.


Changes
-------

Accidently uploaded the code with old class name. Update to the new name.


Repository: samza-hello-samza


Description
-------

a filereader system which reads specified file contents onto stream.

1. FileReaderSystemAdmin implements SystemAdmin and is similar to SinglePartitionWithoutOffsetsSystemAdmin except it updates offset.
2. FileReaderSystemFactory implements SystemFactory. It throws a Samza exception for getProducer because this system does not suppose to write to files.
3. FileReaderSystemConsumer gets file path from stream names, creates threads for each file and then read files line-by-line. It skips certain lines when its startingOffset is not zero, meaning it has checkpoint.


Diffs (updated)
-----

  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemAdmin.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemConsumer.java PRE-CREATION 
  samza-wikipedia/src/main/java/samza/examples/filereader/system/FileReaderSystemFactory.java PRE-CREATION 

Diff: https://reviews.apache.org/r/20869/diff/


Testing
-------


Thanks,

Yan Fang