You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Roy Teeuwen <ro...@teeuwen.be> on 2018/03/04 14:22:00 UTC

Finding out the diskspace used of specific nodes

Hey guys,

I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.

What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.

Thanks!
Roy

Re: Finding out the diskspace used of specific nodes

Posted by Roy Teeuwen <ro...@teeuwen.be>.
Hey Michael,

Cool, thanks!

I haven't used the internals of Oak itself yet, so I will have to do some initial knowledge gathering on how to get a record id of a specific jcr Node, but I guess that won't be the hardest work.

Thanks,
Roy

> On 7 Mar 2018, at 15:47, Michael Dürig <md...@apache.org> wrote:
> 
> Hi,
> 
> I just came across the
> org.apache.jackrabbit.oak.segment.RecordUsageAnalyser class in Oak,
> which I completely forgot about before. I think you can use that one to
> parse nodes and have it list some statistics about them. Alternatively
> you should be able to relatively easy come up with your own tooling
> based on org.apache.jackrabbit.oak.segment.SegmentParser (which is also
> the base for RecordUsageAnalyser). Please take care though, these tools
> are not very deeply tested and any results obtained by them should be
> placed under scrutiny.
> 
> Michael
> 
> On 04.03.18 15:22, Roy Teeuwen wrote:
>> Hey guys,
>> 
>> I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.
>> 
>> What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.
>> 
>> Thanks!
>> Roy
>> 


Re: Finding out the diskspace used of specific nodes

Posted by Michael Dürig <md...@apache.org>.
Hi,

I just came across the
org.apache.jackrabbit.oak.segment.RecordUsageAnalyser class in Oak,
which I completely forgot about before. I think you can use that one to
parse nodes and have it list some statistics about them. Alternatively
you should be able to relatively easy come up with your own tooling
based on org.apache.jackrabbit.oak.segment.SegmentParser (which is also
the base for RecordUsageAnalyser). Please take care though, these tools
are not very deeply tested and any results obtained by them should be
placed under scrutiny.

Michael

On 04.03.18 15:22, Roy Teeuwen wrote:
> Hey guys,
>
> I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.
>
> What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.
>
> Thanks!
> Roy
>

Re: Finding out the diskspace used of specific nodes

Posted by Michael Dürig <mi...@gmail.com>.
Hi,

I think you could do the same via a Groovy script. Depending on how
deep you want to dig into the lower layers you would need to hack your
way through though. The tooling I started building aims to simplify
this (but didn't fully succeed at it yet).

Michael

On 6 March 2018 at 22:06, Roy Teeuwen <ro...@teeuwen.be> wrote:
> Hey Michael,
>
> Thanks for the info!  I will have a look if I can still run the script on an oak 1.6.6, who knows :).
> Can you tell me what the difference would be in making a groovy script and running it in oak-run? Are there things you can't do in there that you can in the scala ammonite shell you use?
>
> Greets,
> Roy
>
>> On 5 Mar 2018, at 13:59, Michael Dürig <mi...@gmail.com> wrote:
>>
>> Hi,
>>
>> Unfortunately there is no good tooling at this point in time.
>>
>> In the past I hacked something together, which might serve as a
>> starting point: https://github.com/mduerig/script-oak. This tooling
>> allows you to fire arbitrary queries at the segment store from the
>> Ammonite shell (a Scala REPL). Since this relies of a lot of
>> implementation details that keep changing the tooling is usually out
>> of sync with Oak. There is plans to improve this (see
>> https://issues.apache.org/jira/browse/OAK-6584), but so far not much
>> commitment in making his happen. Patches welcome though!
>>
>> Michael
>>
>> On 4 March 2018 at 15:22, Roy Teeuwen <ro...@teeuwen.be> wrote:
>>> Hey guys,
>>>
>>> I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.
>>>
>>> What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.
>>>
>>> Thanks!
>>> Roy
>

Re: Finding out the diskspace used of specific nodes

Posted by Roy Teeuwen <ro...@teeuwen.be>.
Hey Michael,

Thanks for the info!  I will have a look if I can still run the script on an oak 1.6.6, who knows :).
Can you tell me what the difference would be in making a groovy script and running it in oak-run? Are there things you can't do in there that you can in the scala ammonite shell you use?

Greets,
Roy

> On 5 Mar 2018, at 13:59, Michael Dürig <mi...@gmail.com> wrote:
> 
> Hi,
> 
> Unfortunately there is no good tooling at this point in time.
> 
> In the past I hacked something together, which might serve as a
> starting point: https://github.com/mduerig/script-oak. This tooling
> allows you to fire arbitrary queries at the segment store from the
> Ammonite shell (a Scala REPL). Since this relies of a lot of
> implementation details that keep changing the tooling is usually out
> of sync with Oak. There is plans to improve this (see
> https://issues.apache.org/jira/browse/OAK-6584), but so far not much
> commitment in making his happen. Patches welcome though!
> 
> Michael
> 
> On 4 March 2018 at 15:22, Roy Teeuwen <ro...@teeuwen.be> wrote:
>> Hey guys,
>> 
>> I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.
>> 
>> What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.
>> 
>> Thanks!
>> Roy


Re: Finding out the diskspace used of specific nodes

Posted by Michael Dürig <mi...@gmail.com>.
Hi,

Unfortunately there is no good tooling at this point in time.

In the past I hacked something together, which might serve as a
starting point: https://github.com/mduerig/script-oak. This tooling
allows you to fire arbitrary queries at the segment store from the
Ammonite shell (a Scala REPL). Since this relies of a lot of
implementation details that keep changing the tooling is usually out
of sync with Oak. There is plans to improve this (see
https://issues.apache.org/jira/browse/OAK-6584), but so far not much
commitment in making his happen. Patches welcome though!

Michael

On 4 March 2018 at 15:22, Roy Teeuwen <ro...@teeuwen.be> wrote:
> Hey guys,
>
> I am using Oak 1.6.6 with an authoring system and a few publish systems. We are using the latest TarMK that is available on the 1.6.6 branch and also using the separate file datastore instead of embedded in the segment store.
>
> What I have noticed so far is that the segment store of the author is 16GB with 165GB datastore while the publishes are 1.5GB with only 50GB datastore. I would like to investigate where the big difference is between those two systems, seeing as all the content nodes are as good as all published. The offline compaction happens daily so that can't be the problem, also the online compaction is enabled. Are there any tools / methods available to list out what the disk usage is of every node? This being both in the segmentstore and the related datastore files? I can make wild guesses as to it being for example sling event / job nodes and stuff like that but I would like some real numbers.
>
> Thanks!
> Roy