You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Ali Nazemian <al...@gmail.com> on 2016/10/13 06:50:05 UTC

Nifi hardware recommendation

Dear Nifi Users/ developers,
Hi,

I was wondering is there any benchmark about the question that is it better
to dedicate disk control to Nifi or using RAID for this purpose? For
example, which of these scenarios is recommended from the performance point
of view?
Scenario 1:
24 disk in total
2 disk- raid 1 for OS and fileflow repo
2 disk- raid 1 for provenance repo1
2 disk- raid 1 for provenance repo2
2 disk- raid 1 for content repo1
2 disk- raid 1 for content repo2
2 disk- raid 1 for content repo3
2 disk- raid 1 for content repo4
2 disk- raid 1 for content repo5
2 disk- raid 1 for content repo6
2 disk- raid 1 for content repo7
2 disk- raid 1 for content repo8
2 disk- raid 1 for content repo9


Scenario 2:
24 disk in total
2 disk- raid 1 for OS and fileflow repo
4 disk- raid 10 for provenance repo1
18 disk- raid 10 for content repo1

Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
Thank you very much.

Best regards,
Ali

Re: Nifi hardware recommendation

Posted by Joe Witt <jo...@gmail.com>.
I'd also add to Mark's great reply that another good use of RAM beyond the
HEAP and disk caching and avoiding swapping is that you can do things like
off-heap native storage of things like reference datasets that can wired
into NiFi flows for high speed enrichment where you can even do hot
swapping of older and newer versions of those references sets.

On Fri, Oct 14, 2016 at 8:41 AM, Mark Payne <ma...@hotmail.com> wrote:

> Hi Ali,
>
> Typically, we see people using a 4-8 GB heap with NiFi. 8 GB is pretty
> typical for a flow that is expected to have
> pretty high throughput in terms of the number of FlowFiles, or a large
> number of processors. However, one thing
> that you will want to consider in terms of RAM is disk caching. While NiFi
> may not use a huge amount of RAM
> directly, the operating system's disk cache is immensely valuable. Because
> the content of FlowFiles is written
> to disk, having a small amount of RAM can result in the next processor
> needing to read that content from disk.
> However, with a sufficient amount of RAM, you will see by looking at
> operating system metrics such as (iostat -xmh 5, for linux)
> that NiFi almost never reads FlowFile content from disk. Instead, it is
> able to get all it needs from the disk cache.
> Frequently querying provenance data also shows a huge difference in
> performance if you have enough RAM.
>
> So the ideal case, I would say, is to have enough RAM for NiFi's heap, as
> well as the content size of all FlowFiles
> that will be actively in your flow at once, plus all other things that
> need to go on, on that box. That said, NiFi should
> certainly work fine reading the content from disk if it needs to - just
> with lower performance.
>
> Does this answer your question?
>
> Thanks
> -Mark
>
>
> On Oct 13, 2016, at 7:47 PM, Ali Nazemian <al...@gmail.com> wrote:
>
> Hi,
>
> I have another question regarding the hardware recommendation. As far as I
> found out, Nifi uses on-heap memory currently, and it will not try to load
> the whole object in memory. From the garbage collection perspective, it is
> not recommended to dedicate more than 8-10 GB to JVM heap space. In this
> case, may I say spending money on system memory is useless? Probably 16 GB
> per each system is enough according to this architecture. Unless some
> architecture changes appear in the future to use off-heap memory as well.
> However, I found some articles about best practices, and in terms of memory
> recommendation it does not make sense. Would you please clarify this part
> for me?
> Thank you very much.
>
> Best regards,
> Ali
>
>
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <al...@gmail.com>
> wrote:
>
>> Thank you very much.
>> I would be more than happy to provide some benchmark results after the
>> implementation.
>> Sincerely yours,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Ali,
>>>
>>> I agree with your assumption.  It would be great to test that out and
>>> provide some numbers but intuitively I agree.
>>>
>>> I could envision certain scatter/gather data flows that could challenge
>>> that sequential access assumption but honestly with how awesome disk
>>> caching is in Linux these days in think practically speaking this is the
>>> right way to think about it.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com>
>>> wrote:
>>>
>>>> Dear Joe,
>>>>
>>>> Thank you very much. That was a really great explanation.
>>>> I investigated the Nifi architecture, and it seems that most of the
>>>> read/write operations for flow file repo and provenance repo are random.
>>>> However, for content repo most of the read/write operations are sequential.
>>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>>> Hence, it would be better to spend content repo SSD money on network
>>>> infrastructure.
>>>>
>>>> Best regards,
>>>> Ali
>>>>
>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>>>>
>>>>> Ali,
>>>>>
>>>>> You have a lot of nice resources to work with there.  I'd recommend
>>>>> the series of RAID-1 configuration personally provided you keep in mind
>>>>> this means you can only lose a single disk for any one partition.  As long
>>>>> as they're being monitored and would be quickly replaced this in practice
>>>>> works well.  If there could be lapses in monitoring or time to replace then
>>>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>>
>>>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>>>> flow file repository.  It will not be able to use all the space but it
>>>>> certainly could benefit from having no other contention.  This could be a
>>>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>>>> them up for content and provenance as you have.  You get to make the
>>>>> overall performance versus retention decision.  Frankly, you have a great
>>>>> system to work with and I suspect you're going to see excellent results
>>>>> anyway.
>>>>>
>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>>> the content repository so if you end up with 8 of them could achieve
>>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>>> of ways to play the game.
>>>>>
>>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>>> content - that would be quite nice.  At that rate to take full advantage of
>>>>> the system you'd need to have very strong network infrastructure between
>>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>>> be well tuned for GC/memory efficiency.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Nifi Users/ developers,
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering is there any benchmark about the question that is it
>>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>>>> example, which of these scenarios is recommended from the performance point
>>>>>> of view?
>>>>>> Scenario 1:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>> 2 disk- raid 1 for content repo1
>>>>>> 2 disk- raid 1 for content repo2
>>>>>> 2 disk- raid 1 for content repo3
>>>>>> 2 disk- raid 1 for content repo4
>>>>>> 2 disk- raid 1 for content repo5
>>>>>> 2 disk- raid 1 for content repo6
>>>>>> 2 disk- raid 1 for content repo7
>>>>>> 2 disk- raid 1 for content repo8
>>>>>> 2 disk- raid 1 for content repo9
>>>>>>
>>>>>>
>>>>>> Scenario 2:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>> 18 disk- raid 10 for content repo1
>>>>>>
>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best regards,
>>>>>> Ali
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>
>
>

Re: Nifi hardware recommendation

Posted by Mark Payne <ma...@hotmail.com>.
Hi Ali,

Typically, we see people using a 4-8 GB heap with NiFi. 8 GB is pretty typical for a flow that is expected to have
pretty high throughput in terms of the number of FlowFiles, or a large number of processors. However, one thing
that you will want to consider in terms of RAM is disk caching. While NiFi may not use a huge amount of RAM
directly, the operating system's disk cache is immensely valuable. Because the content of FlowFiles is written
to disk, having a small amount of RAM can result in the next processor needing to read that content from disk.
However, with a sufficient amount of RAM, you will see by looking at operating system metrics such as (iostat -xmh 5, for linux)
that NiFi almost never reads FlowFile content from disk. Instead, it is able to get all it needs from the disk cache.
Frequently querying provenance data also shows a huge difference in performance if you have enough RAM.

So the ideal case, I would say, is to have enough RAM for NiFi's heap, as well as the content size of all FlowFiles
that will be actively in your flow at once, plus all other things that need to go on, on that box. That said, NiFi should
certainly work fine reading the content from disk if it needs to - just with lower performance.

Does this answer your question?

Thanks
-Mark


> On Oct 13, 2016, at 7:47 PM, Ali Nazemian <al...@gmail.com> wrote:
> 
> Hi,
> 
> I have another question regarding the hardware recommendation. As far as I found out, Nifi uses on-heap memory currently, and it will not try to load the whole object in memory. From the garbage collection perspective, it is not recommended to dedicate more than 8-10 GB to JVM heap space. In this case, may I say spending money on system memory is useless? Probably 16 GB per each system is enough according to this architecture. Unless some architecture changes appear in the future to use off-heap memory as well. However, I found some articles about best practices, and in terms of memory recommendation it does not make sense. Would you please clarify this part for me?
> Thank you very much.
> 
> Best regards,
> Ali
> 
> 
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <alinazemian@gmail.com <ma...@gmail.com>> wrote:
> Thank you very much. 
> I would be more than happy to provide some benchmark results after the implementation. 
> 
> Sincerely yours,
> Ali
> 
> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> Ali,
> 
> I agree with your assumption.  It would be great to test that out and provide some numbers but intuitively I agree.
> 
> I could envision certain scatter/gather data flows that could challenge that sequential access assumption but honestly with how awesome disk caching is in Linux these days in think practically speaking this is the right way to think about it.
> 
> Thanks
> Joe
> 
> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <alinazemian@gmail.com <ma...@gmail.com>> wrote:
> Dear Joe,
> 
> Thank you very much. That was a really great explanation. 
> I investigated the Nifi architecture, and it seems that most of the read/write operations for flow file repo and provenance repo are random. However, for content repo most of the read/write operations are sequential. Let's say cost does not matter. In this case, even choosing SSD for content repo can not provide huge performance gain instead of HDD. Am I right? Hence, it would be better to spend content repo SSD money on network infrastructure.
> 
> Best regards,
> Ali
> 
> 
> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> Ali,
> 
> You have a lot of nice resources to work with there.  I'd recommend the series of RAID-1 configuration personally provided you keep in mind this means you can only lose a single disk for any one partition.  As long as they're being monitored and would be quickly replaced this in practice works well.  If there could be lapses in monitoring or time to replace then it is perhaps safer to go with more redundancy or an alternative RAID type.
> 
> I'd say do the OS, app installs w/user and audit db stuff, application logs on one physical RAID volume.  Have a dedicated physical volume for the flow file repository.  It will not be able to use all the space but it certainly could benefit from having no other contention.  This could be a great thing to have SSDs for actually.  And for the remaining volumes split them up for content and provenance as you have.  You get to make the overall performance versus retention decision.  Frankly, you have a great system to work with and I suspect you're going to see excellent results anyway.
> 
> Conservatively speaking expect say 50MB/s of throughput per volume in the content repository so if you end up with 8 of them could achieve upwards of 400MB/s sustained.  You'll also then want to make sure you have a good 10G based network setup as well.  Or, you could dial back on the speed tradeoff and simply increase retention or disk loss tolerance.  Lots of ways to play the game.
> 
> There are no published SSD vs HDD performance benchmarks that I am aware of though this is a good idea.  Having a hybrid of SSDs and HDDs could offer a really solid performance/retention/cost tradeoff.  For example having SSDs for the OS/logs/provenance/flowfile with HDDs for the content - that would be quite nice.  At that rate to take full advantage of the system you'd need to have very strong network infrastructure between NiFi and any systems it is interfacing with  and your flows would need to be well tuned for GC/memory efficiency.
> 
> Thanks
> Joe 
> 
> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <alinazemian@gmail.com <ma...@gmail.com>> wrote:
> Dear Nifi Users/ developers,
> Hi,
> 
> I was wondering is there any benchmark about the question that is it better to dedicate disk control to Nifi or using RAID for this purpose? For example, which of these scenarios is recommended from the performance point of view? 
> Scenario 1: 
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 2 disk- raid 1 for provenance repo1
> 2 disk- raid 1 for provenance repo2
> 2 disk- raid 1 for content repo1
> 2 disk- raid 1 for content repo2
> 2 disk- raid 1 for content repo3
> 2 disk- raid 1 for content repo4
> 2 disk- raid 1 for content repo5
> 2 disk- raid 1 for content repo6
> 2 disk- raid 1 for content repo7
> 2 disk- raid 1 for content repo8
> 2 disk- raid 1 for content repo9
> 
> 
> Scenario 2: 
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 4 disk- raid 10 for provenance repo1
> 18 disk- raid 10 for content repo1
> 
> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
> Thank you very much.
> 
> Best regards,
> Ali
> 
> 
> 
> 
> 
> -- 
> A.Nazemian
> 
> 
> 
> 
> -- 
> A.Nazemian
> 
> 
> 
> -- 
> A.Nazemian


Re: Nifi hardware recommendation

Posted by Ali Nazemian <al...@gmail.com>.
Dear all,

Thank you very much for all of the detailed responses. About JVM heap space
recommendation, I am aware that it is possible to optimize 64GB heap space
for JVM, but it may cause lots of pauses in some cases. Anyway, my question
was not about how much memory space is required for Nifi JVM because
definitely it needs lots of effort to optimize JVM for this purpose. In
this question, I was looking for finding different memory use cases in Nifi
architecture. Hence, if I want to summarize the responses about memory use
cases in Nifi I have to say that it is required for JVM on-heap + Disk
caching + off-heap usage for some reference datasets.
Really thanks for all of the responses.

Best regards,
Ali

On Sat, Oct 15, 2016 at 10:43 AM, Joe Witt <jo...@gmail.com> wrote:

> The validity of that advice depends on a lot of factors.  G1 changed the
> game a bit for pause times for sure but you can still see larger pause
> times than acceptable for some cases.  In any event I agree that we should
> be more careful with how we describe heap usage.
>
> Thanks
> Joe
>
> On Oct 14, 2016 7:10 PM, "Russell Bateman" <russell.bateman@
> perfectsearchcorp.com> wrote:
>
> Yeah, I spent a bit of time this morning before posting looking for a
> magic 8-10Gb advisory and generally for GC gotchas related to larger heap
> sizes in the 64-bit world, but couldn't find any. We're using 12Gb right
> now for NiFi and haven't noticed any trouble. We vaguely conceive of
> increasing this amount in the future as needed as our servers tend to run
> large amounts of memory.
>
> The statement yesterday on this thread warning against using that much is
> what sent me into Google-it mode. I think this advice is a red herring.
>
> Russ
>
> On 10/14/2016 03:03 PM, Corey Flowers wrote:
>
> We actually use heap sizes from 32 to 64Gb for ours but our volumes and
> graphs are both extremely large. Although I believe the smaller heap sizes
> were a limitation of the garbage collection in Java 7. We also moved to ssd
> drives, which did help through put quite a bit. Our systems were actually
> requesting the creation and removal of file handles faster than traditional
> disks could keep up with (we believe). In addition, unlike with traditional
> drives where we tired to minimize caching, we actually forced more disk
> caching when we moved to ssds. Still waiting to see the results of that on
> our volumes, although it does seemed to have help. Also remember, depending
> on how you code them, individual processors can use system memory outside
> of the heap. So you need to take that into consideration when designing the
> servers.
>
> Sent from my iPhone
>
> On Oct 14, 2016, at 1:36 PM, Joe Witt < <jo...@gmail.com>
> joe.witt@gmail.com> wrote:
>
> Russ,
>
> You can definitely find a lot of material on the Internet about Java heap
> sizes, types of garbage collectors, application usage patterns.  By all
> means please do experiment with different sizes appropriate for your case.
> We're not saying NiFi itself has any problem with large heaps.
>
> Thanks
> Joe
>
> On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman <
> <ru...@perfectsearchcorp.com>russell.bateman@perfectsearch
> corp.com> wrote:
>
>> Ali,
>>
>> "not recommended to dedicate more than 8-10 GM to JVM heap space" by
>> whom? Do you have links/references establishing this? I couldn't find
>> anyone saying this or why.
>>
>> Russ
>>
>> On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>>
>> Hi,
>>
>> I have another question regarding the hardware recommendation. As far as
>> I found out, Nifi uses on-heap memory currently, and it will not try to
>> load the whole object in memory. From the garbage collection perspective,
>> it is not recommended to dedicate more than 8-10 GB to JVM heap space. In
>> this case, may I say spending money on system memory is useless? Probably
>> 16 GB per each system is enough according to this architecture. Unless some
>> architecture changes appear in the future to use off-heap memory as well.
>> However, I found some articles about best practices, and in terms of memory
>> recommendation it does not make sense. Would you please clarify this part
>> for me?
>> Thank you very much.
>>
>> Best regards,
>> Ali
>>
>>
>> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian < <al...@gmail.com>
>> alinazemian@gmail.com> wrote:
>>
>>> Thank you very much.
>>> I would be more than happy to provide some benchmark results after the
>>> implementation.
>>> Sincerely yours,
>>> Ali
>>>
>>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt < <jo...@gmail.com>
>>> joe.witt@gmail.com> wrote:
>>>
>>>> Ali,
>>>>
>>>> I agree with your assumption.  It would be great to test that out and
>>>> provide some numbers but intuitively I agree.
>>>>
>>>> I could envision certain scatter/gather data flows that could challenge
>>>> that sequential access assumption but honestly with how awesome disk
>>>> caching is in Linux these days in think practically speaking this is the
>>>> right way to think about it.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian < <al...@gmail.com>
>>>> alinazemian@gmail.com> wrote:
>>>>
>>>>> Dear Joe,
>>>>>
>>>>> Thank you very much. That was a really great explanation.
>>>>> I investigated the Nifi architecture, and it seems that most of the
>>>>> read/write operations for flow file repo and provenance repo are random.
>>>>> However, for content repo most of the read/write operations are sequential.
>>>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>>>> Hence, it would be better to spend content repo SSD money on network
>>>>> infrastructure.
>>>>>
>>>>> Best regards,
>>>>> Ali
>>>>>
>>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt < <jo...@gmail.com>
>>>>> joe.witt@gmail.com> wrote:
>>>>>
>>>>>> Ali,
>>>>>>
>>>>>> You have a lot of nice resources to work with there.  I'd recommend
>>>>>> the series of RAID-1 configuration personally provided you keep in mind
>>>>>> this means you can only lose a single disk for any one partition.  As long
>>>>>> as they're being monitored and would be quickly replaced this in practice
>>>>>> works well.  If there could be lapses in monitoring or time to replace then
>>>>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>>>
>>>>>> I'd say do the OS, app installs w/user and audit db stuff,
>>>>>> application logs on one physical RAID volume.  Have a dedicated physical
>>>>>> volume for the flow file repository.  It will not be able to use all the
>>>>>> space but it certainly could benefit from having no other contention.  This
>>>>>> could be a great thing to have SSDs for actually.  And for the remaining
>>>>>> volumes split them up for content and provenance as you have.  You get to
>>>>>> make the overall performance versus retention decision.  Frankly, you have
>>>>>> a great system to work with and I suspect you're going to see excellent
>>>>>> results anyway.
>>>>>>
>>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>>>> the content repository so if you end up with 8 of them could achieve
>>>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>>>> of ways to play the game.
>>>>>>
>>>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>>>> content - that would be quite nice.  At that rate to take full advantage of
>>>>>> the system you'd need to have very strong network infrastructure between
>>>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>>>> be well tuned for GC/memory efficiency.
>>>>>>
>>>>>> Thanks
>>>>>> Joe
>>>>>>
>>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <
>>>>>> <al...@gmail.com> wrote:
>>>>>>
>>>>>>> Dear Nifi Users/ developers,
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering is there any benchmark about the question that is it
>>>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>>>>> example, which of these scenarios is recommended from the performance point
>>>>>>> of view?
>>>>>>> Scenario 1:
>>>>>>> 24 disk in total
>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>>> 2 disk- raid 1 for content repo1
>>>>>>> 2 disk- raid 1 for content repo2
>>>>>>> 2 disk- raid 1 for content repo3
>>>>>>> 2 disk- raid 1 for content repo4
>>>>>>> 2 disk- raid 1 for content repo5
>>>>>>> 2 disk- raid 1 for content repo6
>>>>>>> 2 disk- raid 1 for content repo7
>>>>>>> 2 disk- raid 1 for content repo8
>>>>>>> 2 disk- raid 1 for content repo9
>>>>>>>
>>>>>>>
>>>>>>> Scenario 2:
>>>>>>> 24 disk in total
>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>>> 18 disk- raid 10 for content repo1
>>>>>>>
>>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Ali
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> A.Nazemian
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>>
>> --
>> A.Nazemian
>>
>>
>>
>
>
>


-- 
A.Nazemian

Re: Nifi hardware recommendation

Posted by Joe Witt <jo...@gmail.com>.
The validity of that advice depends on a lot of factors.  G1 changed the
game a bit for pause times for sure but you can still see larger pause
times than acceptable for some cases.  In any event I agree that we should
be more careful with how we describe heap usage.

Thanks
Joe

On Oct 14, 2016 7:10 PM, "Russell Bateman" <
russell.bateman@perfectsearchcorp.com> wrote:

Yeah, I spent a bit of time this morning before posting looking for a magic
8-10Gb advisory and generally for GC gotchas related to larger heap sizes
in the 64-bit world, but couldn't find any. We're using 12Gb right now for
NiFi and haven't noticed any trouble. We vaguely conceive of increasing
this amount in the future as needed as our servers tend to run large
amounts of memory.

The statement yesterday on this thread warning against using that much is
what sent me into Google-it mode. I think this advice is a red herring.

Russ

On 10/14/2016 03:03 PM, Corey Flowers wrote:

We actually use heap sizes from 32 to 64Gb for ours but our volumes and
graphs are both extremely large. Although I believe the smaller heap sizes
were a limitation of the garbage collection in Java 7. We also moved to ssd
drives, which did help through put quite a bit. Our systems were actually
requesting the creation and removal of file handles faster than traditional
disks could keep up with (we believe). In addition, unlike with traditional
drives where we tired to minimize caching, we actually forced more disk
caching when we moved to ssds. Still waiting to see the results of that on
our volumes, although it does seemed to have help. Also remember, depending
on how you code them, individual processors can use system memory outside
of the heap. So you need to take that into consideration when designing the
servers.

Sent from my iPhone

On Oct 14, 2016, at 1:36 PM, Joe Witt <jo...@gmail.com> wrote:

Russ,

You can definitely find a lot of material on the Internet about Java heap
sizes, types of garbage collectors, application usage patterns.  By all
means please do experiment with different sizes appropriate for your case.
We're not saying NiFi itself has any problem with large heaps.

Thanks
Joe

On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman <russell.bateman@
perfectsearchcorp.com> wrote:

> Ali,
>
> "not recommended to dedicate more than 8-10 GM to JVM heap space" by whom?
> Do you have links/references establishing this? I couldn't find anyone
> saying this or why.
>
> Russ
>
> On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>
> Hi,
>
> I have another question regarding the hardware recommendation. As far as I
> found out, Nifi uses on-heap memory currently, and it will not try to load
> the whole object in memory. From the garbage collection perspective, it is
> not recommended to dedicate more than 8-10 GB to JVM heap space. In this
> case, may I say spending money on system memory is useless? Probably 16 GB
> per each system is enough according to this architecture. Unless some
> architecture changes appear in the future to use off-heap memory as well.
> However, I found some articles about best practices, and in terms of memory
> recommendation it does not make sense. Would you please clarify this part
> for me?
> Thank you very much.
>
> Best regards,
> Ali
>
>
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <al...@gmail.com>
> wrote:
>
>> Thank you very much.
>> I would be more than happy to provide some benchmark results after the
>> implementation.
>> Sincerely yours,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Ali,
>>>
>>> I agree with your assumption.  It would be great to test that out and
>>> provide some numbers but intuitively I agree.
>>>
>>> I could envision certain scatter/gather data flows that could challenge
>>> that sequential access assumption but honestly with how awesome disk
>>> caching is in Linux these days in think practically speaking this is the
>>> right way to think about it.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com>
>>> wrote:
>>>
>>>> Dear Joe,
>>>>
>>>> Thank you very much. That was a really great explanation.
>>>> I investigated the Nifi architecture, and it seems that most of the
>>>> read/write operations for flow file repo and provenance repo are random.
>>>> However, for content repo most of the read/write operations are sequential.
>>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>>> Hence, it would be better to spend content repo SSD money on network
>>>> infrastructure.
>>>>
>>>> Best regards,
>>>> Ali
>>>>
>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>>>>
>>>>> Ali,
>>>>>
>>>>> You have a lot of nice resources to work with there.  I'd recommend
>>>>> the series of RAID-1 configuration personally provided you keep in mind
>>>>> this means you can only lose a single disk for any one partition.  As long
>>>>> as they're being monitored and would be quickly replaced this in practice
>>>>> works well.  If there could be lapses in monitoring or time to replace then
>>>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>>
>>>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>>>> flow file repository.  It will not be able to use all the space but it
>>>>> certainly could benefit from having no other contention.  This could be a
>>>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>>>> them up for content and provenance as you have.  You get to make the
>>>>> overall performance versus retention decision.  Frankly, you have a great
>>>>> system to work with and I suspect you're going to see excellent results
>>>>> anyway.
>>>>>
>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>>> the content repository so if you end up with 8 of them could achieve
>>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>>> of ways to play the game.
>>>>>
>>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>>> content - that would be quite nice.  At that rate to take full advantage of
>>>>> the system you'd need to have very strong network infrastructure between
>>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>>> be well tuned for GC/memory efficiency.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Nifi Users/ developers,
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering is there any benchmark about the question that is it
>>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>>>> example, which of these scenarios is recommended from the performance point
>>>>>> of view?
>>>>>> Scenario 1:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>> 2 disk- raid 1 for content repo1
>>>>>> 2 disk- raid 1 for content repo2
>>>>>> 2 disk- raid 1 for content repo3
>>>>>> 2 disk- raid 1 for content repo4
>>>>>> 2 disk- raid 1 for content repo5
>>>>>> 2 disk- raid 1 for content repo6
>>>>>> 2 disk- raid 1 for content repo7
>>>>>> 2 disk- raid 1 for content repo8
>>>>>> 2 disk- raid 1 for content repo9
>>>>>>
>>>>>>
>>>>>> Scenario 2:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>> 18 disk- raid 10 for content repo1
>>>>>>
>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best regards,
>>>>>> Ali
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>
>
>

Re: Nifi hardware recommendation

Posted by Russell Bateman <ru...@perfectsearchcorp.com>.
Yeah, I spent a bit of time this morning before posting looking for a 
magic 8-10Gb advisory and generally for GC gotchas related to larger 
heap sizes in the 64-bit world, but couldn't find any. We're using 12Gb 
right now for NiFi and haven't noticed any trouble. We vaguely conceive 
of increasing this amount in the future as needed as our servers tend to 
run large amounts of memory.

The statement yesterday on this thread warning against using that much 
is what sent me into Google-it mode. I think this advice is a red herring.

Russ

On 10/14/2016 03:03 PM, Corey Flowers wrote:
> We actually use heap sizes from 32 to 64Gb for ours but our volumes 
> and graphs are both extremely large. Although I believe the smaller 
> heap sizes were a limitation of the garbage collection in Java 7. We 
> also moved to ssd drives, which did help through put quite a bit. Our 
> systems were actually requesting the creation and removal of file 
> handles faster than traditional disks could keep up with (we believe). 
> In addition, unlike with traditional drives where we tired to minimize 
> caching, we actually forced more disk caching when we moved to ssds. 
> Still waiting to see the results of that on our volumes, although it 
> does seemed to have help. Also remember, depending on how you code 
> them, individual processors can use system memory outside of the heap. 
> So you need to take that into consideration when designing the servers.
>
> Sent from my iPhone
>
> On Oct 14, 2016, at 1:36 PM, Joe Witt <joe.witt@gmail.com 
> <ma...@gmail.com>> wrote:
>
>> Russ,
>>
>> You can definitely find a lot of material on the Internet about Java 
>> heap sizes, types of garbage collectors, application usage patterns.  
>> By all means please do experiment with different sizes appropriate 
>> for your case.  We're not saying NiFi itself has any problem with 
>> large heaps.
>>
>> Thanks
>> Joe
>>
>> On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman 
>> <russell.bateman@perfectsearchcorp.com 
>> <ma...@perfectsearchcorp.com>> wrote:
>>
>>     Ali,
>>
>>     "not recommended to dedicate more than 8-10 GM to JVM heap space"
>>     by whom? Do you have links/references establishing this? I
>>     couldn't find anyone saying this or why.
>>
>>     Russ
>>
>>     On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>>>     Hi,
>>>
>>>     I have another question regarding the hardware recommendation.
>>>     As far as I found out, Nifi uses on-heap memory currently, and
>>>     it will not try to load the whole object in memory. From the
>>>     garbage collection perspective, it is not recommended to
>>>     dedicate more than 8-10 GB to JVM heap space. In this case, may
>>>     I say spending money on system memory is useless? Probably 16 GB
>>>     per each system is enough according to this architecture. Unless
>>>     some architecture changes appear in the future to use off-heap
>>>     memory as well. However, I found some articles about best
>>>     practices, and in terms of memory recommendation it does not
>>>     make sense. Would you please clarify this part for me?
>>>     Thank you very much.
>>>
>>>     Best regards,
>>>     Ali
>>>
>>>
>>>     On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian
>>>     <alinazemian@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>         Thank you very much.
>>>         I would be more than happy to provide some benchmark results
>>>         after the implementation.
>>>         Sincerely yours,
>>>         Ali
>>>
>>>         On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt
>>>         <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>             Ali,
>>>
>>>             I agree with your assumption.  It would be great to test
>>>             that out and provide some numbers but intuitively I agree.
>>>
>>>             I could envision certain scatter/gather data flows that
>>>             could challenge that sequential access assumption but
>>>             honestly with how awesome disk caching is in Linux these
>>>             days in think practically speaking this is the right way
>>>             to think about it.
>>>
>>>             Thanks
>>>             Joe
>>>
>>>             On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian
>>>             <alinazemian@gmail.com <ma...@gmail.com>>
>>>             wrote:
>>>
>>>                 Dear Joe,
>>>
>>>                 Thank you very much. That was a really great
>>>                 explanation.
>>>                 I investigated the Nifi architecture, and it seems
>>>                 that most of the read/write operations for flow file
>>>                 repo and provenance repo are random. However, for
>>>                 content repo most of the read/write operations are
>>>                 sequential. Let's say cost does not matter. In this
>>>                 case, even choosing SSD for content repo can not
>>>                 provide huge performance gain instead of HDD. Am I
>>>                 right? Hence, it would be better to spend content
>>>                 repo SSD money on network infrastructure.
>>>
>>>                 Best regards,
>>>                 Ali
>>>
>>>                 On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt
>>>                 <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>                     Ali,
>>>
>>>                     You have a lot of nice resources to work with
>>>                     there.  I'd recommend the series of RAID-1
>>>                     configuration personally provided you keep in
>>>                     mind this means you can only lose a single disk
>>>                     for any one partition.  As long as they're being
>>>                     monitored and would be quickly replaced this in
>>>                     practice works well. If there could be lapses in
>>>                     monitoring or time to replace then it is perhaps
>>>                     safer to go with more redundancy or an
>>>                     alternative RAID type.
>>>
>>>                     I'd say do the OS, app installs w/user and audit
>>>                     db stuff, application logs on one physical RAID
>>>                     volume.  Have a dedicated physical volume for
>>>                     the flow file repository. It will not be able to
>>>                     use all the space but it certainly could benefit
>>>                     from having no other contention. This could be a
>>>                     great thing to have SSDs for actually. And for
>>>                     the remaining volumes split them up for content
>>>                     and provenance as you have.  You get to make the
>>>                     overall performance versus retention decision.
>>>                     Frankly, you have a great system to work with
>>>                     and I suspect you're going to see excellent
>>>                     results anyway.
>>>
>>>                     Conservatively speaking expect say 50MB/s of
>>>                     throughput per volume in the content repository
>>>                     so if you end up with 8 of them could achieve
>>>                     upwards of 400MB/s sustained. You'll also then
>>>                     want to make sure you have a good 10G based
>>>                     network setup as well.  Or, you could dial back
>>>                     on the speed tradeoff and simply increase
>>>                     retention or disk loss tolerance. Lots of ways
>>>                     to play the game.
>>>
>>>                     There are no published SSD vs HDD performance
>>>                     benchmarks that I am aware of though this is a
>>>                     good idea. Having a hybrid of SSDs and HDDs
>>>                     could offer a really solid
>>>                     performance/retention/cost tradeoff.  For
>>>                     example having SSDs for the
>>>                     OS/logs/provenance/flowfile with HDDs for the
>>>                     content - that would be quite nice. At that rate
>>>                     to take full advantage of the system you'd need
>>>                     to have very strong network infrastructure
>>>                     between NiFi and any systems it is interfacing
>>>                     with  and your flows would need to be well tuned
>>>                     for GC/memory efficiency.
>>>
>>>                     Thanks
>>>                     Joe
>>>
>>>                     On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian
>>>                     <alinazemian@gmail.com
>>>                     <ma...@gmail.com>> wrote:
>>>
>>>                         Dear Nifi Users/ developers,
>>>                         Hi,
>>>
>>>                         I was wondering is there any benchmark about
>>>                         the question that is it better to dedicate
>>>                         disk control to Nifi or using RAID for this
>>>                         purpose? For example, which of these
>>>                         scenarios is recommended from the
>>>                         performance point of view?
>>>                         Scenario 1:
>>>                         24 disk in total
>>>                         2 disk- raid 1 for OS and fileflow repo
>>>                         2 disk- raid 1 for provenance repo1
>>>                         2 disk- raid 1 for provenance repo2
>>>                         2 disk- raid 1 for content repo1
>>>                         2 disk- raid 1 for content repo2
>>>                         2 disk- raid 1 for content repo3
>>>                         2 disk- raid 1 for content repo4
>>>                         2 disk- raid 1 for content repo5
>>>                         2 disk- raid 1 for content repo6
>>>                         2 disk- raid 1 for content repo7
>>>                         2 disk- raid 1 for content repo8
>>>                         2 disk- raid 1 for content repo9
>>>
>>>
>>>                         Scenario 2:
>>>                         24 disk in total
>>>                         2 disk- raid 1 for OS and fileflow repo
>>>                         4 disk- raid 10 for provenance repo1
>>>                         18 disk- raid 10 for content repo1
>>>
>>>                         Moreover, is there any benchmark for SSD vs
>>>                         HDD performance for Nifi?
>>>                         Thank you very much.
>>>
>>>                         Best regards,
>>>                         Ali
>>>
>>>
>>>
>>>
>>>
>>>                 -- 
>>>                 A.Nazemian
>>>
>>>
>>>
>>>
>>>
>>>         -- 
>>>         A.Nazemian
>>>
>>>
>>>
>>>
>>>     -- 
>>>     A.Nazemian
>>
>>


Re: Nifi hardware recommendation

Posted by Corey Flowers <cf...@onyxpoint.com>.
We actually use heap sizes from 32 to 64Gb for ours but our volumes and graphs are both extremely large. Although I believe the smaller heap sizes were a limitation of the garbage collection in Java 7. We also moved to ssd drives, which did help through put quite a bit. Our systems were actually requesting the creation and removal of file handles faster than traditional disks could keep up with (we believe). In addition, unlike with traditional drives where we tired to minimize caching, we actually forced more disk caching when we moved to ssds. Still waiting to see the results of that on our volumes, although it does seemed to have help. Also remember, depending on how you code them, individual processors can use system memory outside of the heap. So you need to take that into consideration when designing the servers. 

Sent from my iPhone

> On Oct 14, 2016, at 1:36 PM, Joe Witt <jo...@gmail.com> wrote:
> 
> Russ,
> 
> You can definitely find a lot of material on the Internet about Java heap sizes, types of garbage collectors, application usage patterns.  By all means please do experiment with different sizes appropriate for your case.  We're not saying NiFi itself has any problem with large heaps.
> 
> Thanks
> Joe
> 
>> On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman <ru...@perfectsearchcorp.com> wrote:
>> Ali,
>> 
>> "not recommended to dedicate more than 8-10 GM to JVM heap space" by whom? Do you have links/references establishing this? I couldn't find anyone saying this or why.
>> 
>> Russ
>> 
>>> On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>>> Hi,
>>> 
>>> I have another question regarding the hardware recommendation. As far as I found out, Nifi uses on-heap memory currently, and it will not try to load the whole object in memory. From the garbage collection perspective, it is not recommended to dedicate more than 8-10 GB to JVM heap space. In this case, may I say spending money on system memory is useless? Probably 16 GB per each system is enough according to this architecture. Unless some architecture changes appear in the future to use off-heap memory as well. However, I found some articles about best practices, and in terms of memory recommendation it does not make sense. Would you please clarify this part for me?
>>> Thank you very much.
>>> 
>>> Best regards,
>>> Ali
>>> 
>>> 
>>> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <al...@gmail.com> wrote:
>>>> Thank you very much. 
>>>> I would be more than happy to provide some benchmark results after the implementation. 
>>>> 
>>>> Sincerely yours,
>>>> Ali
>>>> 
>>>>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <jo...@gmail.com> wrote:
>>>>> Ali,
>>>>> 
>>>>> I agree with your assumption.  It would be great to test that out and provide some numbers but intuitively I agree.
>>>>> 
>>>>> I could envision certain scatter/gather data flows that could challenge that sequential access assumption but honestly with how awesome disk caching is in Linux these days in think practically speaking this is the right way to think about it.
>>>>> 
>>>>> Thanks
>>>>> Joe
>>>>> 
>>>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com> wrote:
>>>>>> Dear Joe,
>>>>>> 
>>>>>> Thank you very much. That was a really great explanation. 
>>>>>> I investigated the Nifi architecture, and it seems that most of the read/write operations for flow file repo and provenance repo are random. However, for content repo most of the read/write operations are sequential. Let's say cost does not matter. In this case, even choosing SSD for content repo can not provide huge performance gain instead of HDD. Am I right? Hence, it would be better to spend content repo SSD money on network infrastructure.
>>>>>> 
>>>>>> Best regards,
>>>>>> Ali
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>>>>>>> Ali,
>>>>>>> 
>>>>>>> You have a lot of nice resources to work with there.  I'd recommend the series of RAID-1 configuration personally provided you keep in mind this means you can only lose a single disk for any one partition.  As long as they're being monitored and would be quickly replaced this in practice works well.  If there could be lapses in monitoring or time to replace then it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>>>> 
>>>>>>> I'd say do the OS, app installs w/user and audit db stuff, application logs on one physical RAID volume.  Have a dedicated physical volume for the flow file repository.  It will not be able to use                                               all the space but it certainly could benefit from having no other contention.  This could be a great thing to have SSDs for actually.  And for the remaining volumes split them up for content and provenance as you have.  You get to make the overall performance versus retention decision.                                                Frankly, you have a great system to work with and I suspect you're going to                                               see excellent results anyway.
>>>>>>> 
>>>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in the content repository so if you end up with 8 of them could achieve upwards of 400MB/s sustained.  You'll also then want to make sure you have a good 10G based network setup as well.  Or, you could dial back on the speed tradeoff and simply increase retention or disk loss tolerance.  Lots of ways to play the game.
>>>>>>> 
>>>>>>> There are no published SSD vs HDD performance benchmarks that I am aware of though this is a good idea.  Having a hybrid of SSDs and HDDs could offer a really solid performance/retention/cost tradeoff.  For example having SSDs for the OS/logs/provenance/flowfile with HDDs for the content - that would be quite nice.  At that rate to take full advantage of the system you'd need to have very strong network infrastructure between NiFi and any systems it is interfacing with  and your flows would need to be well tuned for GC/memory efficiency.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Joe 
>>>>>>> 
>>>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com> wrote:
>>>>>>>> Dear Nifi Users/ developers,
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I was wondering is there any benchmark about the question that is it better to dedicate disk control to Nifi                                                         or using RAID for this purpose? For example, which of these scenarios is recommended from the performance point of view? 
>>>>>>>> Scenario 1: 
>>>>>>>> 24 disk in total
>>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>>>> 2 disk- raid 1 for content repo1
>>>>>>>> 2 disk- raid 1 for content repo2
>>>>>>>> 2 disk- raid 1 for content repo3
>>>>>>>> 2 disk- raid 1 for content repo4
>>>>>>>> 2 disk- raid 1 for content repo5
>>>>>>>> 2 disk- raid 1 for content repo6
>>>>>>>> 2 disk- raid 1 for content repo7
>>>>>>>> 2 disk- raid 1 for content repo8
>>>>>>>> 2 disk- raid 1 for content repo9
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Scenario 2: 
>>>>>>>> 24 disk in total
>>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>>>> 18 disk- raid 10 for content repo1
>>>>>>>> 
>>>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>>>> Thank you very much.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Ali
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> A.Nazemian
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> A.Nazemian
>>> 
>>> 
>>> 
>>> -- 
>>> A.Nazemian
>> 
> 

Re: Nifi hardware recommendation

Posted by Joe Witt <jo...@gmail.com>.
Russ,

You can definitely find a lot of material on the Internet about Java heap
sizes, types of garbage collectors, application usage patterns.  By all
means please do experiment with different sizes appropriate for your case.
We're not saying NiFi itself has any problem with large heaps.

Thanks
Joe

On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman <
russell.bateman@perfectsearchcorp.com> wrote:

> Ali,
>
> "not recommended to dedicate more than 8-10 GM to JVM heap space" by whom?
> Do you have links/references establishing this? I couldn't find anyone
> saying this or why.
>
> Russ
>
> On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>
> Hi,
>
> I have another question regarding the hardware recommendation. As far as I
> found out, Nifi uses on-heap memory currently, and it will not try to load
> the whole object in memory. From the garbage collection perspective, it is
> not recommended to dedicate more than 8-10 GB to JVM heap space. In this
> case, may I say spending money on system memory is useless? Probably 16 GB
> per each system is enough according to this architecture. Unless some
> architecture changes appear in the future to use off-heap memory as well.
> However, I found some articles about best practices, and in terms of memory
> recommendation it does not make sense. Would you please clarify this part
> for me?
> Thank you very much.
>
> Best regards,
> Ali
>
>
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <al...@gmail.com>
> wrote:
>
>> Thank you very much.
>> I would be more than happy to provide some benchmark results after the
>> implementation.
>> Sincerely yours,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt < <jo...@gmail.com>
>> joe.witt@gmail.com> wrote:
>>
>>> Ali,
>>>
>>> I agree with your assumption.  It would be great to test that out and
>>> provide some numbers but intuitively I agree.
>>>
>>> I could envision certain scatter/gather data flows that could challenge
>>> that sequential access assumption but honestly with how awesome disk
>>> caching is in Linux these days in think practically speaking this is the
>>> right way to think about it.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com>
>>> wrote:
>>>
>>>> Dear Joe,
>>>>
>>>> Thank you very much. That was a really great explanation.
>>>> I investigated the Nifi architecture, and it seems that most of the
>>>> read/write operations for flow file repo and provenance repo are random.
>>>> However, for content repo most of the read/write operations are sequential.
>>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>>> Hence, it would be better to spend content repo SSD money on network
>>>> infrastructure.
>>>>
>>>> Best regards,
>>>> Ali
>>>>
>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt < <jo...@gmail.com>
>>>> joe.witt@gmail.com> wrote:
>>>>
>>>>> Ali,
>>>>>
>>>>> You have a lot of nice resources to work with there.  I'd recommend
>>>>> the series of RAID-1 configuration personally provided you keep in mind
>>>>> this means you can only lose a single disk for any one partition.  As long
>>>>> as they're being monitored and would be quickly replaced this in practice
>>>>> works well.  If there could be lapses in monitoring or time to replace then
>>>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>>
>>>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>>>> flow file repository.  It will not be able to use all the space but it
>>>>> certainly could benefit from having no other contention.  This could be a
>>>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>>>> them up for content and provenance as you have.  You get to make the
>>>>> overall performance versus retention decision.  Frankly, you have a great
>>>>> system to work with and I suspect you're going to see excellent results
>>>>> anyway.
>>>>>
>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>>> the content repository so if you end up with 8 of them could achieve
>>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>>> of ways to play the game.
>>>>>
>>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>>> content - that would be quite nice.  At that rate to take full advantage of
>>>>> the system you'd need to have very strong network infrastructure between
>>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>>> be well tuned for GC/memory efficiency.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <
>>>>> <al...@gmail.com> wrote:
>>>>>
>>>>>> Dear Nifi Users/ developers,
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering is there any benchmark about the question that is it
>>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>>>> example, which of these scenarios is recommended from the performance point
>>>>>> of view?
>>>>>> Scenario 1:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>> 2 disk- raid 1 for content repo1
>>>>>> 2 disk- raid 1 for content repo2
>>>>>> 2 disk- raid 1 for content repo3
>>>>>> 2 disk- raid 1 for content repo4
>>>>>> 2 disk- raid 1 for content repo5
>>>>>> 2 disk- raid 1 for content repo6
>>>>>> 2 disk- raid 1 for content repo7
>>>>>> 2 disk- raid 1 for content repo8
>>>>>> 2 disk- raid 1 for content repo9
>>>>>>
>>>>>>
>>>>>> Scenario 2:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>> 18 disk- raid 10 for content repo1
>>>>>>
>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best regards,
>>>>>> Ali
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>
>
>

Re: Nifi hardware recommendation

Posted by Russell Bateman <ru...@perfectsearchcorp.com>.
Ali,

"not recommended to dedicate more than 8-10 GM to JVM heap space" by 
whom? Do you have links/references establishing this? I couldn't find 
anyone saying this or why.

Russ

On 10/13/2016 05:47 PM, Ali Nazemian wrote:
> Hi,
>
> I have another question regarding the hardware recommendation. As far 
> as I found out, Nifi uses on-heap memory currently, and it will not 
> try to load the whole object in memory. From the garbage collection 
> perspective, it is not recommended to dedicate more than 8-10 GB to 
> JVM heap space. In this case, may I say spending money on system 
> memory is useless? Probably 16 GB per each system is enough according 
> to this architecture. Unless some architecture changes appear in the 
> future to use off-heap memory as well. However, I found some articles 
> about best practices, and in terms of memory recommendation it does 
> not make sense. Would you please clarify this part for me?
> Thank you very much.
>
> Best regards,
> Ali
>
>
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <alinazemian@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Thank you very much.
>     I would be more than happy to provide some benchmark results after
>     the implementation.
>     Sincerely yours,
>     Ali
>
>     On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <joe.witt@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Ali,
>
>         I agree with your assumption.  It would be great to test that
>         out and provide some numbers but intuitively I agree.
>
>         I could envision certain scatter/gather data flows that could
>         challenge that sequential access assumption but honestly with
>         how awesome disk caching is in Linux these days in think
>         practically speaking this is the right way to think about it.
>
>         Thanks
>         Joe
>
>         On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian
>         <alinazemian@gmail.com <ma...@gmail.com>> wrote:
>
>             Dear Joe,
>
>             Thank you very much. That was a really great explanation.
>             I investigated the Nifi architecture, and it seems that
>             most of the read/write operations for flow file repo and
>             provenance repo are random. However, for content repo most
>             of the read/write operations are sequential. Let's say
>             cost does not matter. In this case, even choosing SSD for
>             content repo can not provide huge performance gain instead
>             of HDD. Am I right? Hence, it would be better to spend
>             content repo SSD money on network infrastructure.
>
>             Best regards,
>             Ali
>
>             On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt
>             <joe.witt@gmail.com <ma...@gmail.com>> wrote:
>
>                 Ali,
>
>                 You have a lot of nice resources to work with there. 
>                 I'd recommend the series of RAID-1 configuration
>                 personally provided you keep in mind this means you
>                 can only lose a single disk for any one partition.  As
>                 long as they're being monitored and would be quickly
>                 replaced this in practice works well.  If there could
>                 be lapses in monitoring or time to replace then it is
>                 perhaps safer to go with more redundancy or an
>                 alternative RAID type.
>
>                 I'd say do the OS, app installs w/user and audit db
>                 stuff, application logs on one physical RAID volume. 
>                 Have a dedicated physical volume for the flow file
>                 repository.  It will not be able to use all the space
>                 but it certainly could benefit from having no other
>                 contention.  This could be a great thing to have SSDs
>                 for actually.  And for the remaining volumes split
>                 them up for content and provenance as you have. You
>                 get to make the overall performance versus retention
>                 decision. Frankly, you have a great system to work
>                 with and I suspect you're going to see excellent
>                 results anyway.
>
>                 Conservatively speaking expect say 50MB/s of
>                 throughput per volume in the content repository so if
>                 you end up with 8 of them could achieve upwards of
>                 400MB/s sustained. You'll also then want to make sure
>                 you have a good 10G based network setup as well.  Or,
>                 you could dial back on the speed tradeoff and simply
>                 increase retention or disk loss tolerance.  Lots of
>                 ways to play the game.
>
>                 There are no published SSD vs HDD performance
>                 benchmarks that I am aware of though this is a good
>                 idea.  Having a hybrid of SSDs and HDDs could offer a
>                 really solid performance/retention/cost tradeoff.  For
>                 example having SSDs for the
>                 OS/logs/provenance/flowfile with HDDs for the content
>                 - that would be quite nice.  At that rate to take full
>                 advantage of the system you'd need to have very strong
>                 network infrastructure between NiFi and any systems it
>                 is interfacing with  and your flows would need to be
>                 well tuned for GC/memory efficiency.
>
>                 Thanks
>                 Joe
>
>                 On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian
>                 <alinazemian@gmail.com <ma...@gmail.com>>
>                 wrote:
>
>                     Dear Nifi Users/ developers,
>                     Hi,
>
>                     I was wondering is there any benchmark about the
>                     question that is it better to dedicate disk
>                     control to Nifi or using RAID for this purpose?
>                     For example, which of these scenarios is
>                     recommended from the performance point of view?
>                     Scenario 1:
>                     24 disk in total
>                     2 disk- raid 1 for OS and fileflow repo
>                     2 disk- raid 1 for provenance repo1
>                     2 disk- raid 1 for provenance repo2
>                     2 disk- raid 1 for content repo1
>                     2 disk- raid 1 for content repo2
>                     2 disk- raid 1 for content repo3
>                     2 disk- raid 1 for content repo4
>                     2 disk- raid 1 for content repo5
>                     2 disk- raid 1 for content repo6
>                     2 disk- raid 1 for content repo7
>                     2 disk- raid 1 for content repo8
>                     2 disk- raid 1 for content repo9
>
>
>                     Scenario 2:
>                     24 disk in total
>                     2 disk- raid 1 for OS and fileflow repo
>                     4 disk- raid 10 for provenance repo1
>                     18 disk- raid 10 for content repo1
>
>                     Moreover, is there any benchmark for SSD vs HDD
>                     performance for Nifi?
>                     Thank you very much.
>
>                     Best regards,
>                     Ali
>
>
>
>
>
>             -- 
>             A.Nazemian
>
>
>
>
>
>     -- 
>     A.Nazemian
>
>
>
>
> -- 
> A.Nazemian


Re: Nifi hardware recommendation

Posted by Ali Nazemian <al...@gmail.com>.
Hi,

I have another question regarding the hardware recommendation. As far as I
found out, Nifi uses on-heap memory currently, and it will not try to load
the whole object in memory. From the garbage collection perspective, it is
not recommended to dedicate more than 8-10 GB to JVM heap space. In this
case, may I say spending money on system memory is useless? Probably 16 GB
per each system is enough according to this architecture. Unless some
architecture changes appear in the future to use off-heap memory as well.
However, I found some articles about best practices, and in terms of memory
recommendation it does not make sense. Would you please clarify this part
for me?
Thank you very much.

Best regards,
Ali


On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <al...@gmail.com>
wrote:

> Thank you very much.
> I would be more than happy to provide some benchmark results after the
> implementation.
> Sincerely yours,
> Ali
>
> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Ali,
>>
>> I agree with your assumption.  It would be great to test that out and
>> provide some numbers but intuitively I agree.
>>
>> I could envision certain scatter/gather data flows that could challenge
>> that sequential access assumption but honestly with how awesome disk
>> caching is in Linux these days in think practically speaking this is the
>> right way to think about it.
>>
>> Thanks
>> Joe
>>
>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com>
>> wrote:
>>
>>> Dear Joe,
>>>
>>> Thank you very much. That was a really great explanation.
>>> I investigated the Nifi architecture, and it seems that most of the
>>> read/write operations for flow file repo and provenance repo are random.
>>> However, for content repo most of the read/write operations are sequential.
>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>> Hence, it would be better to spend content repo SSD money on network
>>> infrastructure.
>>>
>>> Best regards,
>>> Ali
>>>
>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>>> Ali,
>>>>
>>>> You have a lot of nice resources to work with there.  I'd recommend the
>>>> series of RAID-1 configuration personally provided you keep in mind this
>>>> means you can only lose a single disk for any one partition.  As long as
>>>> they're being monitored and would be quickly replaced this in practice
>>>> works well.  If there could be lapses in monitoring or time to replace then
>>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>>
>>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>>> flow file repository.  It will not be able to use all the space but it
>>>> certainly could benefit from having no other contention.  This could be a
>>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>>> them up for content and provenance as you have.  You get to make the
>>>> overall performance versus retention decision.  Frankly, you have a great
>>>> system to work with and I suspect you're going to see excellent results
>>>> anyway.
>>>>
>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>> the content repository so if you end up with 8 of them could achieve
>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>> of ways to play the game.
>>>>
>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>> content - that would be quite nice.  At that rate to take full advantage of
>>>> the system you'd need to have very strong network infrastructure between
>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>> be well tuned for GC/memory efficiency.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Nifi Users/ developers,
>>>>> Hi,
>>>>>
>>>>> I was wondering is there any benchmark about the question that is it
>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>>> example, which of these scenarios is recommended from the performance point
>>>>> of view?
>>>>> Scenario 1:
>>>>> 24 disk in total
>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>> 2 disk- raid 1 for provenance repo1
>>>>> 2 disk- raid 1 for provenance repo2
>>>>> 2 disk- raid 1 for content repo1
>>>>> 2 disk- raid 1 for content repo2
>>>>> 2 disk- raid 1 for content repo3
>>>>> 2 disk- raid 1 for content repo4
>>>>> 2 disk- raid 1 for content repo5
>>>>> 2 disk- raid 1 for content repo6
>>>>> 2 disk- raid 1 for content repo7
>>>>> 2 disk- raid 1 for content repo8
>>>>> 2 disk- raid 1 for content repo9
>>>>>
>>>>>
>>>>> Scenario 2:
>>>>> 24 disk in total
>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>> 4 disk- raid 10 for provenance repo1
>>>>> 18 disk- raid 10 for content repo1
>>>>>
>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>> Thank you very much.
>>>>>
>>>>> Best regards,
>>>>> Ali
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>
>>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Nifi hardware recommendation

Posted by Ali Nazemian <al...@gmail.com>.
Thank you very much.
I would be more than happy to provide some benchmark results after the
implementation.
Sincerely yours,
Ali

On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <jo...@gmail.com> wrote:

> Ali,
>
> I agree with your assumption.  It would be great to test that out and
> provide some numbers but intuitively I agree.
>
> I could envision certain scatter/gather data flows that could challenge
> that sequential access assumption but honestly with how awesome disk
> caching is in Linux these days in think practically speaking this is the
> right way to think about it.
>
> Thanks
> Joe
>
> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
>> Dear Joe,
>>
>> Thank you very much. That was a really great explanation.
>> I investigated the Nifi architecture, and it seems that most of the
>> read/write operations for flow file repo and provenance repo are random.
>> However, for content repo most of the read/write operations are sequential.
>> Let's say cost does not matter. In this case, even choosing SSD for content
>> repo can not provide huge performance gain instead of HDD. Am I right?
>> Hence, it would be better to spend content repo SSD money on network
>> infrastructure.
>>
>> Best regards,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Ali,
>>>
>>> You have a lot of nice resources to work with there.  I'd recommend the
>>> series of RAID-1 configuration personally provided you keep in mind this
>>> means you can only lose a single disk for any one partition.  As long as
>>> they're being monitored and would be quickly replaced this in practice
>>> works well.  If there could be lapses in monitoring or time to replace then
>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>
>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>> flow file repository.  It will not be able to use all the space but it
>>> certainly could benefit from having no other contention.  This could be a
>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>> them up for content and provenance as you have.  You get to make the
>>> overall performance versus retention decision.  Frankly, you have a great
>>> system to work with and I suspect you're going to see excellent results
>>> anyway.
>>>
>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>> the content repository so if you end up with 8 of them could achieve
>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>> a good 10G based network setup as well.  Or, you could dial back on the
>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>> of ways to play the game.
>>>
>>> There are no published SSD vs HDD performance benchmarks that I am aware
>>> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
>>> offer a really solid performance/retention/cost tradeoff.  For example
>>> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
>>> that would be quite nice.  At that rate to take full advantage of the
>>> system you'd need to have very strong network infrastructure between NiFi
>>> and any systems it is interfacing with  and your flows would need to be
>>> well tuned for GC/memory efficiency.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
>>> wrote:
>>>
>>>> Dear Nifi Users/ developers,
>>>> Hi,
>>>>
>>>> I was wondering is there any benchmark about the question that is it
>>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>>> example, which of these scenarios is recommended from the performance point
>>>> of view?
>>>> Scenario 1:
>>>> 24 disk in total
>>>> 2 disk- raid 1 for OS and fileflow repo
>>>> 2 disk- raid 1 for provenance repo1
>>>> 2 disk- raid 1 for provenance repo2
>>>> 2 disk- raid 1 for content repo1
>>>> 2 disk- raid 1 for content repo2
>>>> 2 disk- raid 1 for content repo3
>>>> 2 disk- raid 1 for content repo4
>>>> 2 disk- raid 1 for content repo5
>>>> 2 disk- raid 1 for content repo6
>>>> 2 disk- raid 1 for content repo7
>>>> 2 disk- raid 1 for content repo8
>>>> 2 disk- raid 1 for content repo9
>>>>
>>>>
>>>> Scenario 2:
>>>> 24 disk in total
>>>> 2 disk- raid 1 for OS and fileflow repo
>>>> 4 disk- raid 10 for provenance repo1
>>>> 18 disk- raid 10 for content repo1
>>>>
>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>> Thank you very much.
>>>>
>>>> Best regards,
>>>> Ali
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>


-- 
A.Nazemian

Re: Nifi hardware recommendation

Posted by Joe Witt <jo...@gmail.com>.
Ali,

I agree with your assumption.  It would be great to test that out and
provide some numbers but intuitively I agree.

I could envision certain scatter/gather data flows that could challenge
that sequential access assumption but honestly with how awesome disk
caching is in Linux these days in think practically speaking this is the
right way to think about it.

Thanks
Joe

On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear Joe,
>
> Thank you very much. That was a really great explanation.
> I investigated the Nifi architecture, and it seems that most of the
> read/write operations for flow file repo and provenance repo are random.
> However, for content repo most of the read/write operations are sequential.
> Let's say cost does not matter. In this case, even choosing SSD for content
> repo can not provide huge performance gain instead of HDD. Am I right?
> Hence, it would be better to spend content repo SSD money on network
> infrastructure.
>
> Best regards,
> Ali
>
> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Ali,
>>
>> You have a lot of nice resources to work with there.  I'd recommend the
>> series of RAID-1 configuration personally provided you keep in mind this
>> means you can only lose a single disk for any one partition.  As long as
>> they're being monitored and would be quickly replaced this in practice
>> works well.  If there could be lapses in monitoring or time to replace then
>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>
>> I'd say do the OS, app installs w/user and audit db stuff, application
>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>> flow file repository.  It will not be able to use all the space but it
>> certainly could benefit from having no other contention.  This could be a
>> great thing to have SSDs for actually.  And for the remaining volumes split
>> them up for content and provenance as you have.  You get to make the
>> overall performance versus retention decision.  Frankly, you have a great
>> system to work with and I suspect you're going to see excellent results
>> anyway.
>>
>> Conservatively speaking expect say 50MB/s of throughput per volume in the
>> content repository so if you end up with 8 of them could achieve upwards of
>> 400MB/s sustained.  You'll also then want to make sure you have a good 10G
>> based network setup as well.  Or, you could dial back on the speed tradeoff
>> and simply increase retention or disk loss tolerance.  Lots of ways to play
>> the game.
>>
>> There are no published SSD vs HDD performance benchmarks that I am aware
>> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
>> offer a really solid performance/retention/cost tradeoff.  For example
>> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
>> that would be quite nice.  At that rate to take full advantage of the
>> system you'd need to have very strong network infrastructure between NiFi
>> and any systems it is interfacing with  and your flows would need to be
>> well tuned for GC/memory efficiency.
>>
>> Thanks
>> Joe
>>
>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
>> wrote:
>>
>>> Dear Nifi Users/ developers,
>>> Hi,
>>>
>>> I was wondering is there any benchmark about the question that is it
>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>> example, which of these scenarios is recommended from the performance point
>>> of view?
>>> Scenario 1:
>>> 24 disk in total
>>> 2 disk- raid 1 for OS and fileflow repo
>>> 2 disk- raid 1 for provenance repo1
>>> 2 disk- raid 1 for provenance repo2
>>> 2 disk- raid 1 for content repo1
>>> 2 disk- raid 1 for content repo2
>>> 2 disk- raid 1 for content repo3
>>> 2 disk- raid 1 for content repo4
>>> 2 disk- raid 1 for content repo5
>>> 2 disk- raid 1 for content repo6
>>> 2 disk- raid 1 for content repo7
>>> 2 disk- raid 1 for content repo8
>>> 2 disk- raid 1 for content repo9
>>>
>>>
>>> Scenario 2:
>>> 24 disk in total
>>> 2 disk- raid 1 for OS and fileflow repo
>>> 4 disk- raid 10 for provenance repo1
>>> 18 disk- raid 10 for content repo1
>>>
>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>> Thank you very much.
>>>
>>> Best regards,
>>> Ali
>>>
>>
>>
>
>
> --
> A.Nazemian
>

Re: Nifi hardware recommendation

Posted by Ali Nazemian <al...@gmail.com>.
Dear Joe,

Thank you very much. That was a really great explanation.
I investigated the Nifi architecture, and it seems that most of the
read/write operations for flow file repo and provenance repo are random.
However, for content repo most of the read/write operations are sequential.
Let's say cost does not matter. In this case, even choosing SSD for content
repo can not provide huge performance gain instead of HDD. Am I right?
Hence, it would be better to spend content repo SSD money on network
infrastructure.

Best regards,
Ali

On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <jo...@gmail.com> wrote:

> Ali,
>
> You have a lot of nice resources to work with there.  I'd recommend the
> series of RAID-1 configuration personally provided you keep in mind this
> means you can only lose a single disk for any one partition.  As long as
> they're being monitored and would be quickly replaced this in practice
> works well.  If there could be lapses in monitoring or time to replace then
> it is perhaps safer to go with more redundancy or an alternative RAID type.
>
> I'd say do the OS, app installs w/user and audit db stuff, application
> logs on one physical RAID volume.  Have a dedicated physical volume for the
> flow file repository.  It will not be able to use all the space but it
> certainly could benefit from having no other contention.  This could be a
> great thing to have SSDs for actually.  And for the remaining volumes split
> them up for content and provenance as you have.  You get to make the
> overall performance versus retention decision.  Frankly, you have a great
> system to work with and I suspect you're going to see excellent results
> anyway.
>
> Conservatively speaking expect say 50MB/s of throughput per volume in the
> content repository so if you end up with 8 of them could achieve upwards of
> 400MB/s sustained.  You'll also then want to make sure you have a good 10G
> based network setup as well.  Or, you could dial back on the speed tradeoff
> and simply increase retention or disk loss tolerance.  Lots of ways to play
> the game.
>
> There are no published SSD vs HDD performance benchmarks that I am aware
> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
> offer a really solid performance/retention/cost tradeoff.  For example
> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
> that would be quite nice.  At that rate to take full advantage of the
> system you'd need to have very strong network infrastructure between NiFi
> and any systems it is interfacing with  and your flows would need to be
> well tuned for GC/memory efficiency.
>
> Thanks
> Joe
>
> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com>
> wrote:
>
>> Dear Nifi Users/ developers,
>> Hi,
>>
>> I was wondering is there any benchmark about the question that is it
>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>> example, which of these scenarios is recommended from the performance point
>> of view?
>> Scenario 1:
>> 24 disk in total
>> 2 disk- raid 1 for OS and fileflow repo
>> 2 disk- raid 1 for provenance repo1
>> 2 disk- raid 1 for provenance repo2
>> 2 disk- raid 1 for content repo1
>> 2 disk- raid 1 for content repo2
>> 2 disk- raid 1 for content repo3
>> 2 disk- raid 1 for content repo4
>> 2 disk- raid 1 for content repo5
>> 2 disk- raid 1 for content repo6
>> 2 disk- raid 1 for content repo7
>> 2 disk- raid 1 for content repo8
>> 2 disk- raid 1 for content repo9
>>
>>
>> Scenario 2:
>> 24 disk in total
>> 2 disk- raid 1 for OS and fileflow repo
>> 4 disk- raid 10 for provenance repo1
>> 18 disk- raid 10 for content repo1
>>
>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>> Thank you very much.
>>
>> Best regards,
>> Ali
>>
>
>


-- 
A.Nazemian

Re: Nifi hardware recommendation

Posted by Joe Witt <jo...@gmail.com>.
Ali,

You have a lot of nice resources to work with there.  I'd recommend the
series of RAID-1 configuration personally provided you keep in mind this
means you can only lose a single disk for any one partition.  As long as
they're being monitored and would be quickly replaced this in practice
works well.  If there could be lapses in monitoring or time to replace then
it is perhaps safer to go with more redundancy or an alternative RAID type.

I'd say do the OS, app installs w/user and audit db stuff, application logs
on one physical RAID volume.  Have a dedicated physical volume for the flow
file repository.  It will not be able to use all the space but it certainly
could benefit from having no other contention.  This could be a great thing
to have SSDs for actually.  And for the remaining volumes split them up for
content and provenance as you have.  You get to make the overall
performance versus retention decision.  Frankly, you have a great system to
work with and I suspect you're going to see excellent results anyway.

Conservatively speaking expect say 50MB/s of throughput per volume in the
content repository so if you end up with 8 of them could achieve upwards of
400MB/s sustained.  You'll also then want to make sure you have a good 10G
based network setup as well.  Or, you could dial back on the speed tradeoff
and simply increase retention or disk loss tolerance.  Lots of ways to play
the game.

There are no published SSD vs HDD performance benchmarks that I am aware of
though this is a good idea.  Having a hybrid of SSDs and HDDs could offer a
really solid performance/retention/cost tradeoff.  For example having SSDs
for the OS/logs/provenance/flowfile with HDDs for the content - that would
be quite nice.  At that rate to take full advantage of the system you'd
need to have very strong network infrastructure between NiFi and any
systems it is interfacing with  and your flows would need to be well tuned
for GC/memory efficiency.

Thanks
Joe

On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear Nifi Users/ developers,
> Hi,
>
> I was wondering is there any benchmark about the question that is it
> better to dedicate disk control to Nifi or using RAID for this purpose? For
> example, which of these scenarios is recommended from the performance point
> of view?
> Scenario 1:
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 2 disk- raid 1 for provenance repo1
> 2 disk- raid 1 for provenance repo2
> 2 disk- raid 1 for content repo1
> 2 disk- raid 1 for content repo2
> 2 disk- raid 1 for content repo3
> 2 disk- raid 1 for content repo4
> 2 disk- raid 1 for content repo5
> 2 disk- raid 1 for content repo6
> 2 disk- raid 1 for content repo7
> 2 disk- raid 1 for content repo8
> 2 disk- raid 1 for content repo9
>
>
> Scenario 2:
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 4 disk- raid 10 for provenance repo1
> 18 disk- raid 10 for content repo1
>
> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
> Thank you very much.
>
> Best regards,
> Ali
>