You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@distributedlog.apache.org by Gerrit Sundaram <ge...@gmail.com> on 2016/11/02 09:21:25 UTC

FileSystem API over distributedlog logs

Hi distributedlog folks,

I am new to this community. I am wondering is there anyone tried to build a
file system over replicated logs. There are a lot of similarities between a
filesystem file and a replicated log. You can use files to build replicated
log or use replicated logs to build a filesystem.

I took at the code repo and found there are two files
'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
implement file I/O related API. Did you guys attempt to provide filesystem
API over distributedlog?

I am wondering if it is possible to build a filesystem over distributedlog.
Would this be an interesting topic to this project and the community? I
have two reasons for that
- I can leverage the good stuffs like parallel replication, low latency for
better performance?
- DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
filesystem-like interface. So it would be a nice fit.

- Gerrit

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
It would be great if you guys can push it.

- Gerrit

On Fri, Nov 11, 2016 at 12:21 PM, Leigh Stewart <
lstewart@twitter.com.invalid> wrote:

> Sure we could do it. We skipped last time because dl was not OSS.
>
> Need to find some time though - lets discuss quickly next week.
>
> On Fri, Nov 11, 2016 at 12:10 PM, Sijie Guo <si...@apache.org> wrote:
>
> > /cc Leigh
> >
> > I don't think we pushed the DL related code to kestrel. As I think
> kestrel
> > has been in the deprecation path internally at Twitter. But it might be
> > worth pushing the code change just for reference. Leigh, what's your
> > opinion?
> >
> > - Sijie
> >
> > On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <
> gerritsundaram@gmail.com>
> > wrote:
> >
> >> Sijie, thank your for your comments and suggestions. I will start a
> >> separate thread for discussing the metadata operation primitives.
> >>
> >> BTW, I didn't find any code in kestrel that is related to distributedlog
> >> :( Can you kindly point me the files?
> >>
> >> - Gerrit
> >>
> >>
> >> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:
> >>
> >>>
> >>>
> >>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <
> >>> gerritsundaram@gmail.com> wrote:
> >>>
> >>>> FYI - I tried to use the AppendOnlyStreamWriter and
> >>>> AppendOnlyStreamReader to demonstrate the idea :
> >>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me
> >>>> know if this is a good direction to go after.
> >>>>
> >>>> - Gerrit
> >>>>
> >>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
> >>>> gerritsundaram@gmail.com> wrote:
> >>>>
> >>>>> Hi distributedlog folks,
> >>>>>
> >>>>> I am new to this community. I am wondering is there anyone tried to
> >>>>> build a file system over replicated logs. There are a lot of
> similarities
> >>>>> between a filesystem file and a replicated log. You can use files to
> build
> >>>>> replicated log or use replicated logs to build a filesystem.
> >>>>>
> >>>>> I took at the code repo and found there are two files
> >>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
> >>>>> implement file I/O related API. Did you guys attempt to provide
> filesystem
> >>>>> API over distributedlog?
> >>>>>
> >>>>
> >>> Ah, those two classes were designed for filesystem-like I/O operations.
> >>> We used them for substituting the local-file-based journal in kestrel
> >>> <https://github.com/twitter-archive/kestrel>.
> >>>
> >>
> >>>
> >>>>
> >>>>> I am wondering if it is possible to build a filesystem over
> >>>>> distributedlog. Would this be an interesting topic to this project
> and the
> >>>>> community? I have two reasons for that
> >>>>> - I can leverage the good stuffs like parallel replication, low
> >>>>> latency for better performance?
> >>>>>
> >>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
> >>>>> filesystem-like interface. So it would be a nice fit.
> >>>>>
> >>>>
> >>> this sounds interesting. I don't think there are any major blockers for
> >>> DL exposing a filesystem-like API, as indeed we already did that for
> >>> kestrel. You might need to spend time on refining the metadata
> operations,
> >>> like list files, get file status and such.
> >>>
> >>> Re "better performance" - for data I/O, it should be just fine for
> >>> workloads like writes, tailing reads and caught-up reads (scans). I am
> not
> >>> sure about random reads, as we didn't really pay attention to this at
> >>> Twitter (although Salesforce used bookkeeper as the storage for also
> >>> serving random reads, it should probably work just well).  I am not
> certain
> >>> about metadata operations - we did create/open/delete log streams
> >>> frequently for some of our use cases, but still might be less frequent
> >>> comparing to a filesystem. We have a plan to make the stream primitive
> very
> >>> lightweight, so we can support huge number of streams. We probably can
> work
> >>> together on improving the metadata part.
> >>>
> >>> I took a look at your pull request. I liked your layout - putting it in
> >>> a contrib module to incubate this idea. We definitely welcome any
> >>> contributions that make DL easy to use. Feel free to start a proposal
> >>> discussion
> >>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
> >>> believe there will be a lot of corner cases to discuss.
> >>>
> >>
> >>>
> >>>
> >>>>
> >>>>> - Gerrit
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
It would be great if you guys can push it.

- Gerrit

On Fri, Nov 11, 2016 at 12:21 PM, Leigh Stewart <
lstewart@twitter.com.invalid> wrote:

> Sure we could do it. We skipped last time because dl was not OSS.
>
> Need to find some time though - lets discuss quickly next week.
>
> On Fri, Nov 11, 2016 at 12:10 PM, Sijie Guo <si...@apache.org> wrote:
>
> > /cc Leigh
> >
> > I don't think we pushed the DL related code to kestrel. As I think
> kestrel
> > has been in the deprecation path internally at Twitter. But it might be
> > worth pushing the code change just for reference. Leigh, what's your
> > opinion?
> >
> > - Sijie
> >
> > On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <
> gerritsundaram@gmail.com>
> > wrote:
> >
> >> Sijie, thank your for your comments and suggestions. I will start a
> >> separate thread for discussing the metadata operation primitives.
> >>
> >> BTW, I didn't find any code in kestrel that is related to distributedlog
> >> :( Can you kindly point me the files?
> >>
> >> - Gerrit
> >>
> >>
> >> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:
> >>
> >>>
> >>>
> >>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <
> >>> gerritsundaram@gmail.com> wrote:
> >>>
> >>>> FYI - I tried to use the AppendOnlyStreamWriter and
> >>>> AppendOnlyStreamReader to demonstrate the idea :
> >>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me
> >>>> know if this is a good direction to go after.
> >>>>
> >>>> - Gerrit
> >>>>
> >>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
> >>>> gerritsundaram@gmail.com> wrote:
> >>>>
> >>>>> Hi distributedlog folks,
> >>>>>
> >>>>> I am new to this community. I am wondering is there anyone tried to
> >>>>> build a file system over replicated logs. There are a lot of
> similarities
> >>>>> between a filesystem file and a replicated log. You can use files to
> build
> >>>>> replicated log or use replicated logs to build a filesystem.
> >>>>>
> >>>>> I took at the code repo and found there are two files
> >>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
> >>>>> implement file I/O related API. Did you guys attempt to provide
> filesystem
> >>>>> API over distributedlog?
> >>>>>
> >>>>
> >>> Ah, those two classes were designed for filesystem-like I/O operations.
> >>> We used them for substituting the local-file-based journal in kestrel
> >>> <https://github.com/twitter-archive/kestrel>.
> >>>
> >>
> >>>
> >>>>
> >>>>> I am wondering if it is possible to build a filesystem over
> >>>>> distributedlog. Would this be an interesting topic to this project
> and the
> >>>>> community? I have two reasons for that
> >>>>> - I can leverage the good stuffs like parallel replication, low
> >>>>> latency for better performance?
> >>>>>
> >>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
> >>>>> filesystem-like interface. So it would be a nice fit.
> >>>>>
> >>>>
> >>> this sounds interesting. I don't think there are any major blockers for
> >>> DL exposing a filesystem-like API, as indeed we already did that for
> >>> kestrel. You might need to spend time on refining the metadata
> operations,
> >>> like list files, get file status and such.
> >>>
> >>> Re "better performance" - for data I/O, it should be just fine for
> >>> workloads like writes, tailing reads and caught-up reads (scans). I am
> not
> >>> sure about random reads, as we didn't really pay attention to this at
> >>> Twitter (although Salesforce used bookkeeper as the storage for also
> >>> serving random reads, it should probably work just well).  I am not
> certain
> >>> about metadata operations - we did create/open/delete log streams
> >>> frequently for some of our use cases, but still might be less frequent
> >>> comparing to a filesystem. We have a plan to make the stream primitive
> very
> >>> lightweight, so we can support huge number of streams. We probably can
> work
> >>> together on improving the metadata part.
> >>>
> >>> I took a look at your pull request. I liked your layout - putting it in
> >>> a contrib module to incubate this idea. We definitely welcome any
> >>> contributions that make DL easy to use. Feel free to start a proposal
> >>> discussion
> >>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
> >>> believe there will be a lot of corner cases to discuss.
> >>>
> >>
> >>>
> >>>
> >>>>
> >>>>> - Gerrit
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: FileSystem API over distributedlog logs

Posted by Leigh Stewart <ls...@twitter.com.INVALID>.
Sure we could do it. We skipped last time because dl was not OSS.

Need to find some time though - lets discuss quickly next week.

On Fri, Nov 11, 2016 at 12:10 PM, Sijie Guo <si...@apache.org> wrote:

> /cc Leigh
>
> I don't think we pushed the DL related code to kestrel. As I think kestrel
> has been in the deprecation path internally at Twitter. But it might be
> worth pushing the code change just for reference. Leigh, what's your
> opinion?
>
> - Sijie
>
> On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <ge...@gmail.com>
> wrote:
>
>> Sijie, thank your for your comments and suggestions. I will start a
>> separate thread for discussing the metadata operation primitives.
>>
>> BTW, I didn't find any code in kestrel that is related to distributedlog
>> :( Can you kindly point me the files?
>>
>> - Gerrit
>>
>>
>> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:
>>
>>>
>>>
>>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <
>>> gerritsundaram@gmail.com> wrote:
>>>
>>>> FYI - I tried to use the AppendOnlyStreamWriter and
>>>> AppendOnlyStreamReader to demonstrate the idea :
>>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me
>>>> know if this is a good direction to go after.
>>>>
>>>> - Gerrit
>>>>
>>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
>>>> gerritsundaram@gmail.com> wrote:
>>>>
>>>>> Hi distributedlog folks,
>>>>>
>>>>> I am new to this community. I am wondering is there anyone tried to
>>>>> build a file system over replicated logs. There are a lot of similarities
>>>>> between a filesystem file and a replicated log. You can use files to build
>>>>> replicated log or use replicated logs to build a filesystem.
>>>>>
>>>>> I took at the code repo and found there are two files
>>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>>>> API over distributedlog?
>>>>>
>>>>
>>> Ah, those two classes were designed for filesystem-like I/O operations.
>>> We used them for substituting the local-file-based journal in kestrel
>>> <https://github.com/twitter-archive/kestrel>.
>>>
>>
>>>
>>>>
>>>>> I am wondering if it is possible to build a filesystem over
>>>>> distributedlog. Would this be an interesting topic to this project and the
>>>>> community? I have two reasons for that
>>>>> - I can leverage the good stuffs like parallel replication, low
>>>>> latency for better performance?
>>>>>
>>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>>>> filesystem-like interface. So it would be a nice fit.
>>>>>
>>>>
>>> this sounds interesting. I don't think there are any major blockers for
>>> DL exposing a filesystem-like API, as indeed we already did that for
>>> kestrel. You might need to spend time on refining the metadata operations,
>>> like list files, get file status and such.
>>>
>>> Re "better performance" - for data I/O, it should be just fine for
>>> workloads like writes, tailing reads and caught-up reads (scans). I am not
>>> sure about random reads, as we didn't really pay attention to this at
>>> Twitter (although Salesforce used bookkeeper as the storage for also
>>> serving random reads, it should probably work just well).  I am not certain
>>> about metadata operations - we did create/open/delete log streams
>>> frequently for some of our use cases, but still might be less frequent
>>> comparing to a filesystem. We have a plan to make the stream primitive very
>>> lightweight, so we can support huge number of streams. We probably can work
>>> together on improving the metadata part.
>>>
>>> I took a look at your pull request. I liked your layout - putting it in
>>> a contrib module to incubate this idea. We definitely welcome any
>>> contributions that make DL easy to use. Feel free to start a proposal
>>> discussion
>>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
>>> believe there will be a lot of corner cases to discuss.
>>>
>>
>>>
>>>
>>>>
>>>>> - Gerrit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Sijie Guo <si...@apache.org>.
/cc Leigh

I don't think we pushed the DL related code to kestrel. As I think kestrel
has been in the deprecation path internally at Twitter. But it might be
worth pushing the code change just for reference. Leigh, what's your
opinion?

- Sijie

On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> Sijie, thank your for your comments and suggestions. I will start a
> separate thread for discussing the metadata operation primitives.
>
> BTW, I didn't find any code in kestrel that is related to distributedlog
> :( Can you kindly point me the files?
>
> - Gerrit
>
>
> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:
>
>>
>>
>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <gerritsundaram@gmail.com
>> > wrote:
>>
>>> FYI - I tried to use the AppendOnlyStreamWriter and
>>> AppendOnlyStreamReader to demonstrate the idea :
>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>>> if this is a good direction to go after.
>>>
>>> - Gerrit
>>>
>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
>>> gerritsundaram@gmail.com> wrote:
>>>
>>>> Hi distributedlog folks,
>>>>
>>>> I am new to this community. I am wondering is there anyone tried to
>>>> build a file system over replicated logs. There are a lot of similarities
>>>> between a filesystem file and a replicated log. You can use files to build
>>>> replicated log or use replicated logs to build a filesystem.
>>>>
>>>> I took at the code repo and found there are two files
>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>>> API over distributedlog?
>>>>
>>>
>> Ah, those two classes were designed for filesystem-like I/O operations.
>> We used them for substituting the local-file-based journal in kestrel
>> <https://github.com/twitter-archive/kestrel>.
>>
>
>>
>>>
>>>> I am wondering if it is possible to build a filesystem over
>>>> distributedlog. Would this be an interesting topic to this project and the
>>>> community? I have two reasons for that
>>>> - I can leverage the good stuffs like parallel replication, low latency
>>>> for better performance?
>>>>
>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>>> filesystem-like interface. So it would be a nice fit.
>>>>
>>>
>> this sounds interesting. I don't think there are any major blockers for
>> DL exposing a filesystem-like API, as indeed we already did that for
>> kestrel. You might need to spend time on refining the metadata operations,
>> like list files, get file status and such.
>>
>> Re "better performance" - for data I/O, it should be just fine for
>> workloads like writes, tailing reads and caught-up reads (scans). I am not
>> sure about random reads, as we didn't really pay attention to this at
>> Twitter (although Salesforce used bookkeeper as the storage for also
>> serving random reads, it should probably work just well).  I am not certain
>> about metadata operations - we did create/open/delete log streams
>> frequently for some of our use cases, but still might be less frequent
>> comparing to a filesystem. We have a plan to make the stream primitive very
>> lightweight, so we can support huge number of streams. We probably can work
>> together on improving the metadata part.
>>
>> I took a look at your pull request. I liked your layout - putting it in a
>> contrib module to incubate this idea. We definitely welcome any
>> contributions that make DL easy to use. Feel free to start a proposal
>> discussion
>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
>> believe there will be a lot of corner cases to discuss.
>>
>
>>
>>
>>>
>>>> - Gerrit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Sijie Guo <si...@apache.org>.
/cc Leigh

I don't think we pushed the DL related code to kestrel. As I think kestrel
has been in the deprecation path internally at Twitter. But it might be
worth pushing the code change just for reference. Leigh, what's your
opinion?

- Sijie

On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> Sijie, thank your for your comments and suggestions. I will start a
> separate thread for discussing the metadata operation primitives.
>
> BTW, I didn't find any code in kestrel that is related to distributedlog
> :( Can you kindly point me the files?
>
> - Gerrit
>
>
> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:
>
>>
>>
>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <gerritsundaram@gmail.com
>> > wrote:
>>
>>> FYI - I tried to use the AppendOnlyStreamWriter and
>>> AppendOnlyStreamReader to demonstrate the idea :
>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>>> if this is a good direction to go after.
>>>
>>> - Gerrit
>>>
>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
>>> gerritsundaram@gmail.com> wrote:
>>>
>>>> Hi distributedlog folks,
>>>>
>>>> I am new to this community. I am wondering is there anyone tried to
>>>> build a file system over replicated logs. There are a lot of similarities
>>>> between a filesystem file and a replicated log. You can use files to build
>>>> replicated log or use replicated logs to build a filesystem.
>>>>
>>>> I took at the code repo and found there are two files
>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>>> API over distributedlog?
>>>>
>>>
>> Ah, those two classes were designed for filesystem-like I/O operations.
>> We used them for substituting the local-file-based journal in kestrel
>> <https://github.com/twitter-archive/kestrel>.
>>
>
>>
>>>
>>>> I am wondering if it is possible to build a filesystem over
>>>> distributedlog. Would this be an interesting topic to this project and the
>>>> community? I have two reasons for that
>>>> - I can leverage the good stuffs like parallel replication, low latency
>>>> for better performance?
>>>>
>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>>> filesystem-like interface. So it would be a nice fit.
>>>>
>>>
>> this sounds interesting. I don't think there are any major blockers for
>> DL exposing a filesystem-like API, as indeed we already did that for
>> kestrel. You might need to spend time on refining the metadata operations,
>> like list files, get file status and such.
>>
>> Re "better performance" - for data I/O, it should be just fine for
>> workloads like writes, tailing reads and caught-up reads (scans). I am not
>> sure about random reads, as we didn't really pay attention to this at
>> Twitter (although Salesforce used bookkeeper as the storage for also
>> serving random reads, it should probably work just well).  I am not certain
>> about metadata operations - we did create/open/delete log streams
>> frequently for some of our use cases, but still might be less frequent
>> comparing to a filesystem. We have a plan to make the stream primitive very
>> lightweight, so we can support huge number of streams. We probably can work
>> together on improving the metadata part.
>>
>> I took a look at your pull request. I liked your layout - putting it in a
>> contrib module to incubate this idea. We definitely welcome any
>> contributions that make DL easy to use. Feel free to start a proposal
>> discussion
>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
>> believe there will be a lot of corner cases to discuss.
>>
>
>>
>>
>>>
>>>> - Gerrit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
Sijie, thank your for your comments and suggestions. I will start a
separate thread for discussing the metadata operation primitives.

BTW, I didn't find any code in kestrel that is related to distributedlog :(
Can you kindly point me the files?

- Gerrit

On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:

>
>
> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <ge...@gmail.com>
> wrote:
>
>> FYI - I tried to use the AppendOnlyStreamWriter and
>> AppendOnlyStreamReader to demonstrate the idea :
>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>> if this is a good direction to go after.
>>
>> - Gerrit
>>
>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <gerritsundaram@gmail.com
>> > wrote:
>>
>>> Hi distributedlog folks,
>>>
>>> I am new to this community. I am wondering is there anyone tried to
>>> build a file system over replicated logs. There are a lot of similarities
>>> between a filesystem file and a replicated log. You can use files to build
>>> replicated log or use replicated logs to build a filesystem.
>>>
>>> I took at the code repo and found there are two files
>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>> API over distributedlog?
>>>
>>
> Ah, those two classes were designed for filesystem-like I/O operations. We
> used them for substituting the local-file-based journal in kestrel
> <https://github.com/twitter-archive/kestrel>.
>

>
>>
>>> I am wondering if it is possible to build a filesystem over
>>> distributedlog. Would this be an interesting topic to this project and the
>>> community? I have two reasons for that
>>> - I can leverage the good stuffs like parallel replication, low latency
>>> for better performance?
>>>
>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>> filesystem-like interface. So it would be a nice fit.
>>>
>>
> this sounds interesting. I don't think there are any major blockers for DL
> exposing a filesystem-like API, as indeed we already did that for kestrel.
> You might need to spend time on refining the metadata operations, like list
> files, get file status and such.
>
> Re "better performance" - for data I/O, it should be just fine for
> workloads like writes, tailing reads and caught-up reads (scans). I am not
> sure about random reads, as we didn't really pay attention to this at
> Twitter (although Salesforce used bookkeeper as the storage for also
> serving random reads, it should probably work just well).  I am not certain
> about metadata operations - we did create/open/delete log streams
> frequently for some of our use cases, but still might be less frequent
> comparing to a filesystem. We have a plan to make the stream primitive very
> lightweight, so we can support huge number of streams. We probably can work
> together on improving the metadata part.
>
> I took a look at your pull request. I liked your layout - putting it in a
> contrib module to incubate this idea. We definitely welcome any
> contributions that make DL easy to use. Feel free to start a proposal
> discussion
> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
> believe there will be a lot of corner cases to discuss.
>

>
>
>>
>>> - Gerrit
>>>
>>>
>>>
>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
Sijie, thank your for your comments and suggestions. I will start a
separate thread for discussing the metadata operation primitives.

BTW, I didn't find any code in kestrel that is related to distributedlog :(
Can you kindly point me the files?

- Gerrit

On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <si...@twitter.com> wrote:

>
>
> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <ge...@gmail.com>
> wrote:
>
>> FYI - I tried to use the AppendOnlyStreamWriter and
>> AppendOnlyStreamReader to demonstrate the idea :
>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>> if this is a good direction to go after.
>>
>> - Gerrit
>>
>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <gerritsundaram@gmail.com
>> > wrote:
>>
>>> Hi distributedlog folks,
>>>
>>> I am new to this community. I am wondering is there anyone tried to
>>> build a file system over replicated logs. There are a lot of similarities
>>> between a filesystem file and a replicated log. You can use files to build
>>> replicated log or use replicated logs to build a filesystem.
>>>
>>> I took at the code repo and found there are two files
>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>> API over distributedlog?
>>>
>>
> Ah, those two classes were designed for filesystem-like I/O operations. We
> used them for substituting the local-file-based journal in kestrel
> <https://github.com/twitter-archive/kestrel>.
>

>
>>
>>> I am wondering if it is possible to build a filesystem over
>>> distributedlog. Would this be an interesting topic to this project and the
>>> community? I have two reasons for that
>>> - I can leverage the good stuffs like parallel replication, low latency
>>> for better performance?
>>>
>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>> filesystem-like interface. So it would be a nice fit.
>>>
>>
> this sounds interesting. I don't think there are any major blockers for DL
> exposing a filesystem-like API, as indeed we already did that for kestrel.
> You might need to spend time on refining the metadata operations, like list
> files, get file status and such.
>
> Re "better performance" - for data I/O, it should be just fine for
> workloads like writes, tailing reads and caught-up reads (scans). I am not
> sure about random reads, as we didn't really pay attention to this at
> Twitter (although Salesforce used bookkeeper as the storage for also
> serving random reads, it should probably work just well).  I am not certain
> about metadata operations - we did create/open/delete log streams
> frequently for some of our use cases, but still might be less frequent
> comparing to a filesystem. We have a plan to make the stream primitive very
> lightweight, so we can support huge number of streams. We probably can work
> together on improving the metadata part.
>
> I took a look at your pull request. I liked your layout - putting it in a
> contrib module to incubate this idea. We definitely welcome any
> contributions that make DL easy to use. Feel free to start a proposal
> discussion
> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
> believe there will be a lot of corner cases to discuss.
>

>
>
>>
>>> - Gerrit
>>>
>>>
>>>
>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Sijie Guo <si...@twitter.com.INVALID>.
On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> FYI - I tried to use the AppendOnlyStreamWriter and AppendOnlyStreamReader
> to demonstrate the idea : https://github.com/apache/
> incubator-distributedlog/pulls/43 Let me know if this is a good direction
> to go after.
>
> - Gerrit
>
> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <ge...@gmail.com>
> wrote:
>
>> Hi distributedlog folks,
>>
>> I am new to this community. I am wondering is there anyone tried to build
>> a file system over replicated logs. There are a lot of similarities between
>> a filesystem file and a replicated log. You can use files to build
>> replicated log or use replicated logs to build a filesystem.
>>
>> I took at the code repo and found there are two files
>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>> implement file I/O related API. Did you guys attempt to provide filesystem
>> API over distributedlog?
>>
>
Ah, those two classes were designed for filesystem-like I/O operations. We
used them for substituting the local-file-based journal in kestrel
<https://github.com/twitter-archive/kestrel>.


>
>> I am wondering if it is possible to build a filesystem over
>> distributedlog. Would this be an interesting topic to this project and the
>> community? I have two reasons for that
>> - I can leverage the good stuffs like parallel replication, low latency
>> for better performance?
>>
> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>> filesystem-like interface. So it would be a nice fit.
>>
>
this sounds interesting. I don't think there are any major blockers for DL
exposing a filesystem-like API, as indeed we already did that for kestrel.
You might need to spend time on refining the metadata operations, like list
files, get file status and such.

Re "better performance" - for data I/O, it should be just fine for
workloads like writes, tailing reads and caught-up reads (scans). I am not
sure about random reads, as we didn't really pay attention to this at
Twitter (although Salesforce used bookkeeper as the storage for also
serving random reads, it should probably work just well).  I am not certain
about metadata operations - we did create/open/delete log streams
frequently for some of our use cases, but still might be less frequent
comparing to a filesystem. We have a plan to make the stream primitive very
lightweight, so we can support huge number of streams. We probably can work
together on improving the metadata part.

I took a look at your pull request. I liked your layout - putting it in a
contrib module to incubate this idea. We definitely welcome any
contributions that make DL easy to use. Feel free to start a proposal
discussion
<https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
believe there will be a lot of corner cases to discuss.



>
>> - Gerrit
>>
>>
>>
>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Sijie Guo <si...@twitter.com>.
On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> FYI - I tried to use the AppendOnlyStreamWriter and AppendOnlyStreamReader
> to demonstrate the idea : https://github.com/apache/
> incubator-distributedlog/pulls/43 Let me know if this is a good direction
> to go after.
>
> - Gerrit
>
> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <ge...@gmail.com>
> wrote:
>
>> Hi distributedlog folks,
>>
>> I am new to this community. I am wondering is there anyone tried to build
>> a file system over replicated logs. There are a lot of similarities between
>> a filesystem file and a replicated log. You can use files to build
>> replicated log or use replicated logs to build a filesystem.
>>
>> I took at the code repo and found there are two files
>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>> implement file I/O related API. Did you guys attempt to provide filesystem
>> API over distributedlog?
>>
>
Ah, those two classes were designed for filesystem-like I/O operations. We
used them for substituting the local-file-based journal in kestrel
<https://github.com/twitter-archive/kestrel>.


>
>> I am wondering if it is possible to build a filesystem over
>> distributedlog. Would this be an interesting topic to this project and the
>> community? I have two reasons for that
>> - I can leverage the good stuffs like parallel replication, low latency
>> for better performance?
>>
> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>> filesystem-like interface. So it would be a nice fit.
>>
>
this sounds interesting. I don't think there are any major blockers for DL
exposing a filesystem-like API, as indeed we already did that for kestrel.
You might need to spend time on refining the metadata operations, like list
files, get file status and such.

Re "better performance" - for data I/O, it should be just fine for
workloads like writes, tailing reads and caught-up reads (scans). I am not
sure about random reads, as we didn't really pay attention to this at
Twitter (although Salesforce used bookkeeper as the storage for also
serving random reads, it should probably work just well).  I am not certain
about metadata operations - we did create/open/delete log streams
frequently for some of our use cases, but still might be less frequent
comparing to a filesystem. We have a plan to make the stream primitive very
lightweight, so we can support huge number of streams. We probably can work
together on improving the metadata part.

I took a look at your pull request. I liked your layout - putting it in a
contrib module to incubate this idea. We definitely welcome any
contributions that make DL easy to use. Feel free to start a proposal
discussion
<https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
believe there will be a lot of corner cases to discuss.



>
>> - Gerrit
>>
>>
>>
>>
>>
>

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
FYI - I tried to use the AppendOnlyStreamWriter and AppendOnlyStreamReader
to demonstrate the idea :
https://github.com/apache/incubator-distributedlog/pulls/43 Let me know if
this is a good direction to go after.

- Gerrit

On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> Hi distributedlog folks,
>
> I am new to this community. I am wondering is there anyone tried to build
> a file system over replicated logs. There are a lot of similarities between
> a filesystem file and a replicated log. You can use files to build
> replicated log or use replicated logs to build a filesystem.
>
> I took at the code repo and found there are two files
> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
> implement file I/O related API. Did you guys attempt to provide filesystem
> API over distributedlog?
>
> I am wondering if it is possible to build a filesystem over
> distributedlog. Would this be an interesting topic to this project and the
> community? I have two reasons for that
> - I can leverage the good stuffs like parallel replication, low latency
> for better performance?
> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
> filesystem-like interface. So it would be a nice fit.
>
> - Gerrit
>
>
>
>
>

Re: FileSystem API over distributedlog logs

Posted by Gerrit Sundaram <ge...@gmail.com>.
FYI - I tried to use the AppendOnlyStreamWriter and AppendOnlyStreamReader
to demonstrate the idea :
https://github.com/apache/incubator-distributedlog/pulls/43 Let me know if
this is a good direction to go after.

- Gerrit

On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <ge...@gmail.com>
wrote:

> Hi distributedlog folks,
>
> I am new to this community. I am wondering is there anyone tried to build
> a file system over replicated logs. There are a lot of similarities between
> a filesystem file and a replicated log. You can use files to build
> replicated log or use replicated logs to build a filesystem.
>
> I took at the code repo and found there are two files
> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
> implement file I/O related API. Did you guys attempt to provide filesystem
> API over distributedlog?
>
> I am wondering if it is possible to build a filesystem over
> distributedlog. Would this be an interesting topic to this project and the
> community? I have two reasons for that
> - I can leverage the good stuffs like parallel replication, low latency
> for better performance?
> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
> filesystem-like interface. So it would be a nice fit.
>
> - Gerrit
>
>
>
>
>