You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jacob Sevart <js...@uber.com> on 2020/03/04 00:24:30 UTC

Very large _metadata file

Per the documentation:

"The meta data file of a Savepoint contains (primarily) pointers to all
files on stable storage that are part of the Savepoint, in form of absolute
paths."

I somehow have a _metadata file that's 1.9GB. Running *strings *on it I
find 962 strings, most of which look like HDFS paths, which leaves a lot of
that file-size unexplained. What else is in there, and how exactly could
this be happening?

We're running 1.6.

Jacob

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

https://github.com/apache/flink/pull/11475

On Sat, Mar 21, 2020 at 10:37 AM Jacob Sevart <js...@uber.com> wrote:

> Thanks, will do.
>
> I only want the time stamp to reset when the job comes up with no state.
> Checkpoint recoveries should keep the same value.
>
> Jacob
>
> On Sat, Mar 21, 2020 at 10:16 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Jacob,
>>
>> if you could create patch for updating the union state metadata
>> documentation that would be great. I can help with the review and merging
>> this patch.
>>
>> If the value stays fixed over the lifetime of the job and you know it
>> before starting the job, then you could use the config mechanism. What
>> won't work is if for every restart you would need a different value.
>> Updating the config after a recovery is not possible.
>>
>> Cheers,
>> Till
>>
>> On Fri, Mar 20, 2020 at 6:29 PM Jacob Sevart <js...@uber.com> wrote:
>>
>>> Thanks, makes sense.
>>>
>>> What about using the config mechanism? We're collecting and distributing
>>> some environment variables at startup, would it also work to include a
>>> timestamp with that?
>>>
>>> Also, would you be interested in a patch to note the caveat about union
>>> state metadata in the documentation?
>>>
>>> Jacob
>>>
>>> On Tue, Mar 17, 2020 at 2:51 AM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Did I understand you correctly that you use the union state to
>>>> synchronize the per partition state across all operators in order to obtain
>>>> a global overview? If this is the case, then this will only work in case of
>>>> a failover. Only then, all operators are being restarted with the union of
>>>> all operators state. If the job would never fail, then there would never be
>>>> an exchange of state.
>>>>
>>>> If you really need a global view over your data, then you need to
>>>> create an operator with a parallelism of 1 which records all the different
>>>> timestamps.
>>>>
>>>> Another idea could be to use the broadcast state pattern [1]. You could
>>>> have an operator which extracts the java.time.Instant and outputs them as a
>>>> side output and simply forwards the records on the main output. Then you
>>>> could use the side output as the broadcast input and the main output as the
>>>> normal input into the broadcast operator. The problem with this approach
>>>> might be that you don't get order guarantees between the side and the main
>>>> output.
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=t8gx18WI38mWMMo9o1GAUERpXwVKG5wnYdvT3gBZxo8&s=v2kbM2mYHBcsKjNzFCaaSbg_3vyfYIhoX8stFXSzRnY&e=>
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Mar 17, 2020 at 2:29 AM Jacob Sevart <js...@uber.com> wrote:
>>>>
>>>>> Thanks! That would do it. I've disabled the operator for now.
>>>>>
>>>>> The purpose was to know the age of the job's state, so that we could
>>>>> consider its output in terms of how much context it knows. Regular state
>>>>> seemed insufficient because partitions might see their first traffic at
>>>>> different times.
>>>>>
>>>>> How would you go about implementing something like that?
>>>>>
>>>>> On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Jacob,
>>>>>>
>>>>>> I think you are running into some deficiencies of Flink's union state
>>>>>> here. The problem is that for every entry in your list state, Flink stores
>>>>>> a separate offset (a long value). The reason for this behaviour is that we
>>>>>> use the same state implementation for the union state as well as for the
>>>>>> split state. For the latter, the offset information is required to split
>>>>>> the state in case of changing the parallelism of your job.
>>>>>>
>>>>>> My recommendation would be to try to get rid of union state all
>>>>>> together. The union state has primarily been introduced to checkpoint some
>>>>>> source implementations and might become deprecated due to performance
>>>>>> problems once these sources can be checkpointed differently.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Oh, I should clarify that's 43MB per partition, so with 48
>>>>>>> partitions it explains my 2GB.
>>>>>>>
>>>>>>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I
>>>>>>>> found something:
>>>>>>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>>>>>>> 43MB (5.3 million longs).
>>>>>>>>
>>>>>>>> "startup-times" is an operator state of mine (union list of
>>>>>>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>>>>>>> not sure how the actual size is related to the number of offsets. Can you
>>>>>>>> elaborate on that?
>>>>>>>>
>>>>>>>> Incidentally, 42.5MB is the number I got out of
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-14618
>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>>>>>>> So I think my two problems are closely related.
>>>>>>>>
>>>>>>>> Jacob
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> As Gordon said, the metadata will contain the
>>>>>>>>> ByteStreamStateHandle, when writing out the ByteStreamStateHandle, will
>>>>>>>>> write out the handle name -- which is a path(as you saw). The
>>>>>>>>> ByteStreamStateHandle will be created when state size is small than
>>>>>>>>> `state.backend.fs.memory-threshold`(default is 1024).
>>>>>>>>>
>>>>>>>>> If you want to verify this, you can ref the unit test
>>>>>>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>>>>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>>>>>>> their names are the strings you saw in the metadata.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Congxian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>>>>>>
>>>>>>>>>> Thanks, I will monitor that thread.
>>>>>>>>>>
>>>>>>>>>> I'm having a hard time following the serialization code, but if
>>>>>>>>>> you know anything about the layout, tell me if this makes sense. What I see
>>>>>>>>>> in the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>>>>>>> data. Then finally another HDFS path at the end.
>>>>>>>>>>
>>>>>>>>>> If it is putting state in there, under normal circumstances, does
>>>>>>>>>> it make sense that it would be interleaved with metadata? I would expect
>>>>>>>>>> all the metadata to come first, and then state.
>>>>>>>>>>
>>>>>>>>>> Jacob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Jacob
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <
>>>>>>>>>> kkloudas@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jacob,
>>>>>>>>>>>
>>>>>>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>>>>>>> behavior, but this is a related thread here:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>>>>>>
>>>>>>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Kostas
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Kostas and Gordon,
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>>>>>>> "state.*" section showing in the JobManager UI.
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Jacob
>>>>>>>>>>> >
>>>>>>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>>>>>>> tzulitai@apache.org> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Hi Jacob,
>>>>>>>>>>> >>
>>>>>>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>>>>>>> reason:
>>>>>>>>>>> >>
>>>>>>>>>>> >> If you are using the FsStateBackend, it is also possible that
>>>>>>>>>>> your state is small enough to be considered to be stored inline within the
>>>>>>>>>>> metadata file.
>>>>>>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>>>>>>> `FsStateBackend`.
>>>>>>>>>>> >> The purpose of that threshold is to ensure that the backend
>>>>>>>>>>> does not create a large amount of very small files, where potentially the
>>>>>>>>>>> file pointers are actually larger than the state itself.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Cheers,
>>>>>>>>>>> >> Gordon
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <
>>>>>>>>>>> kkloudas@gmail.com> wrote:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Hi Jacob,
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> The reason I am asking is that, from the documentation in
>>>>>>>>>>> [1]:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>>>>>>> savepoint
>>>>>>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>>>>>>> location."
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>>>>>>> formats.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> I hope this helps,
>>>>>>>>>>> >>> Kostas
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> [1]
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <
>>>>>>>>>>> jsevart@uber.com> wrote:
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > Per the documentation:
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>>>>>>> form of absolute paths."
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running
>>>>>>>>>>> strings on it I find 962 strings, most of which look like HDFS paths, which
>>>>>>>>>>> leaves a lot of that file-size unexplained. What else is in there, and how
>>>>>>>>>>> exactly could this be happening?
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > We're running 1.6.
>>>>>>>>>>> >>> >
>>>>>>>>>>> >>> > Jacob
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > --
>>>>>>>>>>> > Jacob Sevart
>>>>>>>>>>> > Software Engineer, Safety
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jacob Sevart
>>>>>>>>>> Software Engineer, Safety
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacob Sevart
>>>>>>>> Software Engineer, Safety
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacob Sevart
>>>>>>> Software Engineer, Safety
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jacob Sevart
>>>>> Software Engineer, Safety
>>>>>
>>>>
>>>
>>> --
>>> Jacob Sevart
>>> Software Engineer, Safety
>>>
>> --
> Jacob Sevart
> Software Engineer, Safety
>


-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Thanks, will do.

I only want the time stamp to reset when the job comes up with no state.
Checkpoint recoveries should keep the same value.

Jacob

On Sat, Mar 21, 2020 at 10:16 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Jacob,
>
> if you could create patch for updating the union state metadata
> documentation that would be great. I can help with the review and merging
> this patch.
>
> If the value stays fixed over the lifetime of the job and you know it
> before starting the job, then you could use the config mechanism. What
> won't work is if for every restart you would need a different value.
> Updating the config after a recovery is not possible.
>
> Cheers,
> Till
>
> On Fri, Mar 20, 2020 at 6:29 PM Jacob Sevart <js...@uber.com> wrote:
>
>> Thanks, makes sense.
>>
>> What about using the config mechanism? We're collecting and distributing
>> some environment variables at startup, would it also work to include a
>> timestamp with that?
>>
>> Also, would you be interested in a patch to note the caveat about union
>> state metadata in the documentation?
>>
>> Jacob
>>
>> On Tue, Mar 17, 2020 at 2:51 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Did I understand you correctly that you use the union state to
>>> synchronize the per partition state across all operators in order to obtain
>>> a global overview? If this is the case, then this will only work in case of
>>> a failover. Only then, all operators are being restarted with the union of
>>> all operators state. If the job would never fail, then there would never be
>>> an exchange of state.
>>>
>>> If you really need a global view over your data, then you need to create
>>> an operator with a parallelism of 1 which records all the different
>>> timestamps.
>>>
>>> Another idea could be to use the broadcast state pattern [1]. You could
>>> have an operator which extracts the java.time.Instant and outputs them as a
>>> side output and simply forwards the records on the main output. Then you
>>> could use the side output as the broadcast input and the main output as the
>>> normal input into the broadcast operator. The problem with this approach
>>> might be that you don't get order guarantees between the side and the main
>>> output.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=t8gx18WI38mWMMo9o1GAUERpXwVKG5wnYdvT3gBZxo8&s=v2kbM2mYHBcsKjNzFCaaSbg_3vyfYIhoX8stFXSzRnY&e=>
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Mar 17, 2020 at 2:29 AM Jacob Sevart <js...@uber.com> wrote:
>>>
>>>> Thanks! That would do it. I've disabled the operator for now.
>>>>
>>>> The purpose was to know the age of the job's state, so that we could
>>>> consider its output in terms of how much context it knows. Regular state
>>>> seemed insufficient because partitions might see their first traffic at
>>>> different times.
>>>>
>>>> How would you go about implementing something like that?
>>>>
>>>> On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Jacob,
>>>>>
>>>>> I think you are running into some deficiencies of Flink's union state
>>>>> here. The problem is that for every entry in your list state, Flink stores
>>>>> a separate offset (a long value). The reason for this behaviour is that we
>>>>> use the same state implementation for the union state as well as for the
>>>>> split state. For the latter, the offset information is required to split
>>>>> the state in case of changing the parallelism of your job.
>>>>>
>>>>> My recommendation would be to try to get rid of union state all
>>>>> together. The union state has primarily been introduced to checkpoint some
>>>>> source implementations and might become deprecated due to performance
>>>>> problems once these sources can be checkpointed differently.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:
>>>>>
>>>>>> Oh, I should clarify that's 43MB per partition, so with 48 partitions
>>>>>> it explains my 2GB.
>>>>>>
>>>>>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I
>>>>>>> found something:
>>>>>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>>>>>> 43MB (5.3 million longs).
>>>>>>>
>>>>>>> "startup-times" is an operator state of mine (union list of
>>>>>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>>>>>> not sure how the actual size is related to the number of offsets. Can you
>>>>>>> elaborate on that?
>>>>>>>
>>>>>>> Incidentally, 42.5MB is the number I got out of
>>>>>>> https://issues.apache.org/jira/browse/FLINK-14618
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>>>>>> So I think my two problems are closely related.
>>>>>>>
>>>>>>> Jacob
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> As Gordon said, the metadata will contain the
>>>>>>>> ByteStreamStateHandle, when writing out the ByteStreamStateHandle, will
>>>>>>>> write out the handle name -- which is a path(as you saw). The
>>>>>>>> ByteStreamStateHandle will be created when state size is small than
>>>>>>>> `state.backend.fs.memory-threshold`(default is 1024).
>>>>>>>>
>>>>>>>> If you want to verify this, you can ref the unit test
>>>>>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>>>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>>>>>> their names are the strings you saw in the metadata.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Congxian
>>>>>>>>
>>>>>>>>
>>>>>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>>>>>
>>>>>>>>> Thanks, I will monitor that thread.
>>>>>>>>>
>>>>>>>>> I'm having a hard time following the serialization code, but if
>>>>>>>>> you know anything about the layout, tell me if this makes sense. What I see
>>>>>>>>> in the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>>>>>> data. Then finally another HDFS path at the end.
>>>>>>>>>
>>>>>>>>> If it is putting state in there, under normal circumstances, does
>>>>>>>>> it make sense that it would be interleaved with metadata? I would expect
>>>>>>>>> all the metadata to come first, and then state.
>>>>>>>>>
>>>>>>>>> Jacob
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jacob
>>>>>>>>>
>>>>>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jacob,
>>>>>>>>>>
>>>>>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>>>>>> behavior, but this is a related thread here:
>>>>>>>>>>
>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>>>>>
>>>>>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Kostas
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Kostas and Gordon,
>>>>>>>>>> >
>>>>>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>>>>>> "state.*" section showing in the JobManager UI.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Jacob
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>>>>>> tzulitai@apache.org> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Hi Jacob,
>>>>>>>>>> >>
>>>>>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>>>>>> reason:
>>>>>>>>>> >>
>>>>>>>>>> >> If you are using the FsStateBackend, it is also possible that
>>>>>>>>>> your state is small enough to be considered to be stored inline within the
>>>>>>>>>> metadata file.
>>>>>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>>>>>> `FsStateBackend`.
>>>>>>>>>> >> The purpose of that threshold is to ensure that the backend
>>>>>>>>>> does not create a large amount of very small files, where potentially the
>>>>>>>>>> file pointers are actually larger than the state itself.
>>>>>>>>>> >>
>>>>>>>>>> >> Cheers,
>>>>>>>>>> >> Gordon
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <
>>>>>>>>>> kkloudas@gmail.com> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>> Hi Jacob,
>>>>>>>>>> >>>
>>>>>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>>>>>> >>>
>>>>>>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>>>>>>> >>>
>>>>>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>>>>>> savepoint
>>>>>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>>>>>> location."
>>>>>>>>>> >>>
>>>>>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>>>>>> formats.
>>>>>>>>>> >>>
>>>>>>>>>> >>> I hope this helps,
>>>>>>>>>> >>> Kostas
>>>>>>>>>> >>>
>>>>>>>>>> >>> [1]
>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > Per the documentation:
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>>>>>> form of absolute paths."
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running
>>>>>>>>>> strings on it I find 962 strings, most of which look like HDFS paths, which
>>>>>>>>>> leaves a lot of that file-size unexplained. What else is in there, and how
>>>>>>>>>> exactly could this be happening?
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > We're running 1.6.
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > Jacob
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> > Jacob Sevart
>>>>>>>>>> > Software Engineer, Safety
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacob Sevart
>>>>>>>>> Software Engineer, Safety
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacob Sevart
>>>>>>> Software Engineer, Safety
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jacob Sevart
>>>>>> Software Engineer, Safety
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Jacob Sevart
>>>> Software Engineer, Safety
>>>>
>>>
>>
>> --
>> Jacob Sevart
>> Software Engineer, Safety
>>
> --
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Till Rohrmann <tr...@apache.org>.

Hi Jacob,

if you could create patch for updating the union state metadata
documentation that would be great. I can help with the review and merging
this patch.

If the value stays fixed over the lifetime of the job and you know it
before starting the job, then you could use the config mechanism. What
won't work is if for every restart you would need a different value.
Updating the config after a recovery is not possible.

Cheers,
Till

On Fri, Mar 20, 2020 at 6:29 PM Jacob Sevart <js...@uber.com> wrote:

> Thanks, makes sense.
>
> What about using the config mechanism? We're collecting and distributing
> some environment variables at startup, would it also work to include a
> timestamp with that?
>
> Also, would you be interested in a patch to note the caveat about union
> state metadata in the documentation?
>
> Jacob
>
> On Tue, Mar 17, 2020 at 2:51 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Did I understand you correctly that you use the union state to
>> synchronize the per partition state across all operators in order to obtain
>> a global overview? If this is the case, then this will only work in case of
>> a failover. Only then, all operators are being restarted with the union of
>> all operators state. If the job would never fail, then there would never be
>> an exchange of state.
>>
>> If you really need a global view over your data, then you need to create
>> an operator with a parallelism of 1 which records all the different
>> timestamps.
>>
>> Another idea could be to use the broadcast state pattern [1]. You could
>> have an operator which extracts the java.time.Instant and outputs them as a
>> side output and simply forwards the records on the main output. Then you
>> could use the side output as the broadcast input and the main output as the
>> normal input into the broadcast operator. The problem with this approach
>> might be that you don't get order guarantees between the side and the main
>> output.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=t8gx18WI38mWMMo9o1GAUERpXwVKG5wnYdvT3gBZxo8&s=v2kbM2mYHBcsKjNzFCaaSbg_3vyfYIhoX8stFXSzRnY&e=>
>>
>> Cheers,
>> Till
>>
>> On Tue, Mar 17, 2020 at 2:29 AM Jacob Sevart <js...@uber.com> wrote:
>>
>>> Thanks! That would do it. I've disabled the operator for now.
>>>
>>> The purpose was to know the age of the job's state, so that we could
>>> consider its output in terms of how much context it knows. Regular state
>>> seemed insufficient because partitions might see their first traffic at
>>> different times.
>>>
>>> How would you go about implementing something like that?
>>>
>>> On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Jacob,
>>>>
>>>> I think you are running into some deficiencies of Flink's union state
>>>> here. The problem is that for every entry in your list state, Flink stores
>>>> a separate offset (a long value). The reason for this behaviour is that we
>>>> use the same state implementation for the union state as well as for the
>>>> split state. For the latter, the offset information is required to split
>>>> the state in case of changing the parallelism of your job.
>>>>
>>>> My recommendation would be to try to get rid of union state all
>>>> together. The union state has primarily been introduced to checkpoint some
>>>> source implementations and might become deprecated due to performance
>>>> problems once these sources can be checkpointed differently.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:
>>>>
>>>>> Oh, I should clarify that's 43MB per partition, so with 48 partitions
>>>>> it explains my 2GB.
>>>>>
>>>>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:
>>>>>
>>>>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I
>>>>>> found something:
>>>>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>>>>> 43MB (5.3 million longs).
>>>>>>
>>>>>> "startup-times" is an operator state of mine (union list of
>>>>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>>>>> not sure how the actual size is related to the number of offsets. Can you
>>>>>> elaborate on that?
>>>>>>
>>>>>> Incidentally, 42.5MB is the number I got out of
>>>>>> https://issues.apache.org/jira/browse/FLINK-14618
>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>>>>> So I think my two problems are closely related.
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> As Gordon said, the metadata will contain the ByteStreamStateHandle,
>>>>>>> when writing out the ByteStreamStateHandle, will write out the handle name
>>>>>>> -- which is a path(as you saw). The ByteStreamStateHandle will be created
>>>>>>> when state size is small than `state.backend.fs.memory-threshold`(default
>>>>>>> is 1024).
>>>>>>>
>>>>>>> If you want to verify this, you can ref the unit test
>>>>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>>>>> their names are the strings you saw in the metadata.
>>>>>>>
>>>>>>> Best,
>>>>>>> Congxian
>>>>>>>
>>>>>>>
>>>>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>>>>
>>>>>>>> Thanks, I will monitor that thread.
>>>>>>>>
>>>>>>>> I'm having a hard time following the serialization code, but if you
>>>>>>>> know anything about the layout, tell me if this makes sense. What I see in
>>>>>>>> the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>>>>> data. Then finally another HDFS path at the end.
>>>>>>>>
>>>>>>>> If it is putting state in there, under normal circumstances, does
>>>>>>>> it make sense that it would be interleaved with metadata? I would expect
>>>>>>>> all the metadata to come first, and then state.
>>>>>>>>
>>>>>>>> Jacob
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Jacob
>>>>>>>>
>>>>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Jacob,
>>>>>>>>>
>>>>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>>>>> behavior, but this is a related thread here:
>>>>>>>>>
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>>>>
>>>>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Kostas
>>>>>>>>>
>>>>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > Kostas and Gordon,
>>>>>>>>> >
>>>>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>>>>> "state.*" section showing in the JobManager UI.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Jacob
>>>>>>>>> >
>>>>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>>>>> tzulitai@apache.org> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Hi Jacob,
>>>>>>>>> >>
>>>>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>>>>> reason:
>>>>>>>>> >>
>>>>>>>>> >> If you are using the FsStateBackend, it is also possible that
>>>>>>>>> your state is small enough to be considered to be stored inline within the
>>>>>>>>> metadata file.
>>>>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>>>>> `FsStateBackend`.
>>>>>>>>> >> The purpose of that threshold is to ensure that the backend
>>>>>>>>> does not create a large amount of very small files, where potentially the
>>>>>>>>> file pointers are actually larger than the state itself.
>>>>>>>>> >>
>>>>>>>>> >> Cheers,
>>>>>>>>> >> Gordon
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <
>>>>>>>>> kkloudas@gmail.com> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Hi Jacob,
>>>>>>>>> >>>
>>>>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>>>>> >>>
>>>>>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>>>>>> >>>
>>>>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>>>>> savepoint
>>>>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>>>>> location."
>>>>>>>>> >>>
>>>>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>>>>> formats.
>>>>>>>>> >>>
>>>>>>>>> >>> I hope this helps,
>>>>>>>>> >>> Kostas
>>>>>>>>> >>>
>>>>>>>>> >>> [1]
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>>>>> >>>
>>>>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>>>>>> wrote:
>>>>>>>>> >>> >
>>>>>>>>> >>> > Per the documentation:
>>>>>>>>> >>> >
>>>>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>>>>> form of absolute paths."
>>>>>>>>> >>> >
>>>>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running
>>>>>>>>> strings on it I find 962 strings, most of which look like HDFS paths, which
>>>>>>>>> leaves a lot of that file-size unexplained. What else is in there, and how
>>>>>>>>> exactly could this be happening?
>>>>>>>>> >>> >
>>>>>>>>> >>> > We're running 1.6.
>>>>>>>>> >>> >
>>>>>>>>> >>> > Jacob
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Jacob Sevart
>>>>>>>>> > Software Engineer, Safety
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacob Sevart
>>>>>>>> Software Engineer, Safety
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jacob Sevart
>>>>>> Software Engineer, Safety
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jacob Sevart
>>>>> Software Engineer, Safety
>>>>>
>>>>
>>>
>>> --
>>> Jacob Sevart
>>> Software Engineer, Safety
>>>
>>
>
> --
> Jacob Sevart
> Software Engineer, Safety
>

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Thanks, makes sense.

What about using the config mechanism? We're collecting and distributing
some environment variables at startup, would it also work to include a
timestamp with that?

Also, would you be interested in a patch to note the caveat about union
state metadata in the documentation?

Jacob

On Tue, Mar 17, 2020 at 2:51 AM Till Rohrmann <tr...@apache.org> wrote:

> Did I understand you correctly that you use the union state to synchronize
> the per partition state across all operators in order to obtain a global
> overview? If this is the case, then this will only work in case of a
> failover. Only then, all operators are being restarted with the union of
> all operators state. If the job would never fail, then there would never be
> an exchange of state.
>
> If you really need a global view over your data, then you need to create
> an operator with a parallelism of 1 which records all the different
> timestamps.
>
> Another idea could be to use the broadcast state pattern [1]. You could
> have an operator which extracts the java.time.Instant and outputs them as a
> side output and simply forwards the records on the main output. Then you
> could use the side output as the broadcast input and the main output as the
> normal input into the broadcast operator. The problem with this approach
> might be that you don't get order guarantees between the side and the main
> output.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_dev_stream_state_broadcast-5Fstate.html&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=t8gx18WI38mWMMo9o1GAUERpXwVKG5wnYdvT3gBZxo8&s=v2kbM2mYHBcsKjNzFCaaSbg_3vyfYIhoX8stFXSzRnY&e=>
>
> Cheers,
> Till
>
> On Tue, Mar 17, 2020 at 2:29 AM Jacob Sevart <js...@uber.com> wrote:
>
>> Thanks! That would do it. I've disabled the operator for now.
>>
>> The purpose was to know the age of the job's state, so that we could
>> consider its output in terms of how much context it knows. Regular state
>> seemed insufficient because partitions might see their first traffic at
>> different times.
>>
>> How would you go about implementing something like that?
>>
>> On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Jacob,
>>>
>>> I think you are running into some deficiencies of Flink's union state
>>> here. The problem is that for every entry in your list state, Flink stores
>>> a separate offset (a long value). The reason for this behaviour is that we
>>> use the same state implementation for the union state as well as for the
>>> split state. For the latter, the offset information is required to split
>>> the state in case of changing the parallelism of your job.
>>>
>>> My recommendation would be to try to get rid of union state all
>>> together. The union state has primarily been introduced to checkpoint some
>>> source implementations and might become deprecated due to performance
>>> problems once these sources can be checkpointed differently.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:
>>>
>>>> Oh, I should clarify that's 43MB per partition, so with 48 partitions
>>>> it explains my 2GB.
>>>>
>>>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:
>>>>
>>>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I
>>>>> found something:
>>>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>>>> 43MB (5.3 million longs).
>>>>>
>>>>> "startup-times" is an operator state of mine (union list of
>>>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>>>> not sure how the actual size is related to the number of offsets. Can you
>>>>> elaborate on that?
>>>>>
>>>>> Incidentally, 42.5MB is the number I got out of
>>>>> https://issues.apache.org/jira/browse/FLINK-14618
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>>>> So I think my two problems are closely related.
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> As Gordon said, the metadata will contain the ByteStreamStateHandle,
>>>>>> when writing out the ByteStreamStateHandle, will write out the handle name
>>>>>> -- which is a path(as you saw). The ByteStreamStateHandle will be created
>>>>>> when state size is small than `state.backend.fs.memory-threshold`(default
>>>>>> is 1024).
>>>>>>
>>>>>> If you want to verify this, you can ref the unit test
>>>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>>>> their names are the strings you saw in the metadata.
>>>>>>
>>>>>> Best,
>>>>>> Congxian
>>>>>>
>>>>>>
>>>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>>>
>>>>>>> Thanks, I will monitor that thread.
>>>>>>>
>>>>>>> I'm having a hard time following the serialization code, but if you
>>>>>>> know anything about the layout, tell me if this makes sense. What I see in
>>>>>>> the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>>>> data. Then finally another HDFS path at the end.
>>>>>>>
>>>>>>> If it is putting state in there, under normal circumstances, does it
>>>>>>> make sense that it would be interleaved with metadata? I would expect all
>>>>>>> the metadata to come first, and then state.
>>>>>>>
>>>>>>> Jacob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jacob
>>>>>>>
>>>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jacob,
>>>>>>>>
>>>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>>>> behavior, but this is a related thread here:
>>>>>>>>
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>>>
>>>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Kostas
>>>>>>>>
>>>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Kostas and Gordon,
>>>>>>>> >
>>>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>>>> "state.*" section showing in the JobManager UI.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Jacob
>>>>>>>> >
>>>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>>>> tzulitai@apache.org> wrote:
>>>>>>>> >>
>>>>>>>> >> Hi Jacob,
>>>>>>>> >>
>>>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>>>> reason:
>>>>>>>> >>
>>>>>>>> >> If you are using the FsStateBackend, it is also possible that
>>>>>>>> your state is small enough to be considered to be stored inline within the
>>>>>>>> metadata file.
>>>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>>>> `FsStateBackend`.
>>>>>>>> >> The purpose of that threshold is to ensure that the backend does
>>>>>>>> not create a large amount of very small files, where potentially the file
>>>>>>>> pointers are actually larger than the state itself.
>>>>>>>> >>
>>>>>>>> >> Cheers,
>>>>>>>> >> Gordon
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <
>>>>>>>> kkloudas@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi Jacob,
>>>>>>>> >>>
>>>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>>>> >>>
>>>>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>>>>> >>>
>>>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>>>> savepoint
>>>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>>>> location."
>>>>>>>> >>>
>>>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>>>> formats.
>>>>>>>> >>>
>>>>>>>> >>> I hope this helps,
>>>>>>>> >>> Kostas
>>>>>>>> >>>
>>>>>>>> >>> [1]
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>>>>> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> > Per the documentation:
>>>>>>>> >>> >
>>>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>>>> form of absolute paths."
>>>>>>>> >>> >
>>>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings
>>>>>>>> on it I find 962 strings, most of which look like HDFS paths, which leaves
>>>>>>>> a lot of that file-size unexplained. What else is in there, and how exactly
>>>>>>>> could this be happening?
>>>>>>>> >>> >
>>>>>>>> >>> > We're running 1.6.
>>>>>>>> >>> >
>>>>>>>> >>> > Jacob
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Jacob Sevart
>>>>>>>> > Software Engineer, Safety
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacob Sevart
>>>>>>> Software Engineer, Safety
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jacob Sevart
>>>>> Software Engineer, Safety
>>>>>
>>>>
>>>>
>>>> --
>>>> Jacob Sevart
>>>> Software Engineer, Safety
>>>>
>>>
>>
>> --
>> Jacob Sevart
>> Software Engineer, Safety
>>
>

-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Till Rohrmann <tr...@apache.org>.

Did I understand you correctly that you use the union state to synchronize
the per partition state across all operators in order to obtain a global
overview? If this is the case, then this will only work in case of a
failover. Only then, all operators are being restarted with the union of
all operators state. If the job would never fail, then there would never be
an exchange of state.

If you really need a global view over your data, then you need to create an
operator with a parallelism of 1 which records all the different
timestamps.

Another idea could be to use the broadcast state pattern [1]. You could
have an operator which extracts the java.time.Instant and outputs them as a
side output and simply forwards the records on the main output. Then you
could use the side output as the broadcast input and the main output as the
normal input into the broadcast operator. The problem with this approach
might be that you don't get order guarantees between the side and the main
output.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html

Cheers,
Till

On Tue, Mar 17, 2020 at 2:29 AM Jacob Sevart <js...@uber.com> wrote:

> Thanks! That would do it. I've disabled the operator for now.
>
> The purpose was to know the age of the job's state, so that we could
> consider its output in terms of how much context it knows. Regular state
> seemed insufficient because partitions might see their first traffic at
> different times.
>
> How would you go about implementing something like that?
>
> On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Jacob,
>>
>> I think you are running into some deficiencies of Flink's union state
>> here. The problem is that for every entry in your list state, Flink stores
>> a separate offset (a long value). The reason for this behaviour is that we
>> use the same state implementation for the union state as well as for the
>> split state. For the latter, the offset information is required to split
>> the state in case of changing the parallelism of your job.
>>
>> My recommendation would be to try to get rid of union state all together.
>> The union state has primarily been introduced to checkpoint some source
>> implementations and might become deprecated due to performance problems
>> once these sources can be checkpointed differently.
>>
>> Cheers,
>> Till
>>
>> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:
>>
>>> Oh, I should clarify that's 43MB per partition, so with 48 partitions it
>>> explains my 2GB.
>>>
>>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:
>>>
>>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I found
>>>> something:
>>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>>> 43MB (5.3 million longs).
>>>>
>>>> "startup-times" is an operator state of mine (union list of
>>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>>> not sure how the actual size is related to the number of offsets. Can you
>>>> elaborate on that?
>>>>
>>>> Incidentally, 42.5MB is the number I got out of
>>>> https://issues.apache.org/jira/browse/FLINK-14618
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>>> So I think my two problems are closely related.
>>>>
>>>> Jacob
>>>>
>>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> As Gordon said, the metadata will contain the ByteStreamStateHandle,
>>>>> when writing out the ByteStreamStateHandle, will write out the handle name
>>>>> -- which is a path(as you saw). The ByteStreamStateHandle will be created
>>>>> when state size is small than `state.backend.fs.memory-threshold`(default
>>>>> is 1024).
>>>>>
>>>>> If you want to verify this, you can ref the unit test
>>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>>> their names are the strings you saw in the metadata.
>>>>>
>>>>> Best,
>>>>> Congxian
>>>>>
>>>>>
>>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>>
>>>>>> Thanks, I will monitor that thread.
>>>>>>
>>>>>> I'm having a hard time following the serialization code, but if you
>>>>>> know anything about the layout, tell me if this makes sense. What I see in
>>>>>> the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>>> data. Then finally another HDFS path at the end.
>>>>>>
>>>>>> If it is putting state in there, under normal circumstances, does it
>>>>>> make sense that it would be interleaved with metadata? I would expect all
>>>>>> the metadata to come first, and then state.
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jacob
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jacob,
>>>>>>>
>>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>>> behavior, but this is a related thread here:
>>>>>>>
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>>
>>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Kostas
>>>>>>>
>>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Kostas and Gordon,
>>>>>>> >
>>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>>> "state.*" section showing in the JobManager UI.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Jacob
>>>>>>> >
>>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>>> tzulitai@apache.org> wrote:
>>>>>>> >>
>>>>>>> >> Hi Jacob,
>>>>>>> >>
>>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>>> reason:
>>>>>>> >>
>>>>>>> >> If you are using the FsStateBackend, it is also possible that
>>>>>>> your state is small enough to be considered to be stored inline within the
>>>>>>> metadata file.
>>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>>> `FsStateBackend`.
>>>>>>> >> The purpose of that threshold is to ensure that the backend does
>>>>>>> not create a large amount of very small files, where potentially the file
>>>>>>> pointers are actually larger than the state itself.
>>>>>>> >>
>>>>>>> >> Cheers,
>>>>>>> >> Gordon
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Jacob,
>>>>>>> >>>
>>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>>> >>>
>>>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>>>> >>>
>>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>>> savepoint
>>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>>> location."
>>>>>>> >>>
>>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>>> formats.
>>>>>>> >>>
>>>>>>> >>> I hope this helps,
>>>>>>> >>> Kostas
>>>>>>> >>>
>>>>>>> >>> [1]
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>>> >>>
>>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>>>> wrote:
>>>>>>> >>> >
>>>>>>> >>> > Per the documentation:
>>>>>>> >>> >
>>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>>> form of absolute paths."
>>>>>>> >>> >
>>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings
>>>>>>> on it I find 962 strings, most of which look like HDFS paths, which leaves
>>>>>>> a lot of that file-size unexplained. What else is in there, and how exactly
>>>>>>> could this be happening?
>>>>>>> >>> >
>>>>>>> >>> > We're running 1.6.
>>>>>>> >>> >
>>>>>>> >>> > Jacob
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Jacob Sevart
>>>>>>> > Software Engineer, Safety
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jacob Sevart
>>>>>> Software Engineer, Safety
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Jacob Sevart
>>>> Software Engineer, Safety
>>>>
>>>
>>>
>>> --
>>> Jacob Sevart
>>> Software Engineer, Safety
>>>
>>
>
> --
> Jacob Sevart
> Software Engineer, Safety
>

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Thanks! That would do it. I've disabled the operator for now.

The purpose was to know the age of the job's state, so that we could
consider its output in terms of how much context it knows. Regular state
seemed insufficient because partitions might see their first traffic at
different times.

How would you go about implementing something like that?

On Mon, Mar 16, 2020 at 1:54 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Jacob,
>
> I think you are running into some deficiencies of Flink's union state
> here. The problem is that for every entry in your list state, Flink stores
> a separate offset (a long value). The reason for this behaviour is that we
> use the same state implementation for the union state as well as for the
> split state. For the latter, the offset information is required to split
> the state in case of changing the parallelism of your job.
>
> My recommendation would be to try to get rid of union state all together.
> The union state has primarily been introduced to checkpoint some source
> implementations and might become deprecated due to performance problems
> once these sources can be checkpointed differently.
>
> Cheers,
> Till
>
> On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:
>
>> Oh, I should clarify that's 43MB per partition, so with 48 partitions it
>> explains my 2GB.
>>
>> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:
>>
>>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I found
>>> something:
>>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>>> 43MB (5.3 million longs).
>>>
>>> "startup-times" is an operator state of mine (union list of
>>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>>> not sure how the actual size is related to the number of offsets. Can you
>>> elaborate on that?
>>>
>>> Incidentally, 42.5MB is the number I got out of
>>> https://issues.apache.org/jira/browse/FLINK-14618
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14618&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=3KZriZyZgBj7mReI9Giq9_Y59NZ6d_4KGE1RkGm2DCI&s=I6LhM2g2btCo31K3ox7TZhtHQiee95biqJf7Hbj9Dbo&e=>.
>>> So I think my two problems are closely related.
>>>
>>> Jacob
>>>
>>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> As Gordon said, the metadata will contain the ByteStreamStateHandle,
>>>> when writing out the ByteStreamStateHandle, will write out the handle name
>>>> -- which is a path(as you saw). The ByteStreamStateHandle will be created
>>>> when state size is small than `state.backend.fs.memory-threshold`(default
>>>> is 1024).
>>>>
>>>> If you want to verify this, you can ref the unit test
>>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>>> their names are the strings you saw in the metadata.
>>>>
>>>> Best,
>>>> Congxian
>>>>
>>>>
>>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>>
>>>>> Thanks, I will monitor that thread.
>>>>>
>>>>> I'm having a hard time following the serialization code, but if you
>>>>> know anything about the layout, tell me if this makes sense. What I see in
>>>>> the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>>> data. Then finally another HDFS path at the end.
>>>>>
>>>>> If it is putting state in there, under normal circumstances, does it
>>>>> make sense that it would be interleaved with metadata? I would expect all
>>>>> the metadata to come first, and then state.
>>>>>
>>>>> Jacob
>>>>>
>>>>>
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jacob,
>>>>>>
>>>>>> As I said previously I am not 100% sure what can be causing this
>>>>>> behavior, but this is a related thread here:
>>>>>>
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>>
>>>>>> Which you can re-post your problem and monitor for answers.
>>>>>>
>>>>>> Cheers,
>>>>>> Kostas
>>>>>>
>>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>>>>>> >
>>>>>> > Kostas and Gordon,
>>>>>> >
>>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>>> "state.*" section showing in the JobManager UI.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Jacob
>>>>>> >
>>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>>> tzulitai@apache.org> wrote:
>>>>>> >>
>>>>>> >> Hi Jacob,
>>>>>> >>
>>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>>> reason:
>>>>>> >>
>>>>>> >> If you are using the FsStateBackend, it is also possible that your
>>>>>> state is small enough to be considered to be stored inline within the
>>>>>> metadata file.
>>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>>> `FsStateBackend`.
>>>>>> >> The purpose of that threshold is to ensure that the backend does
>>>>>> not create a large amount of very small files, where potentially the file
>>>>>> pointers are actually larger than the state itself.
>>>>>> >>
>>>>>> >> Cheers,
>>>>>> >> Gordon
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Jacob,
>>>>>> >>>
>>>>>> >>> Could you specify which StateBackend you are using?
>>>>>> >>>
>>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>>> >>>
>>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>>> savepoint
>>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>>> >>> self-contained, you may move the file and restore from any
>>>>>> location."
>>>>>> >>>
>>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>>> formats.
>>>>>> >>>
>>>>>> >>> I hope this helps,
>>>>>> >>> Kostas
>>>>>> >>>
>>>>>> >>> [1]
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>>> >>>
>>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>>> wrote:
>>>>>> >>> >
>>>>>> >>> > Per the documentation:
>>>>>> >>> >
>>>>>> >>> > "The meta data file of a Savepoint contains (primarily)
>>>>>> pointers to all files on stable storage that are part of the Savepoint, in
>>>>>> form of absolute paths."
>>>>>> >>> >
>>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings
>>>>>> on it I find 962 strings, most of which look like HDFS paths, which leaves
>>>>>> a lot of that file-size unexplained. What else is in there, and how exactly
>>>>>> could this be happening?
>>>>>> >>> >
>>>>>> >>> > We're running 1.6.
>>>>>> >>> >
>>>>>> >>> > Jacob
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Jacob Sevart
>>>>>> > Software Engineer, Safety
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jacob Sevart
>>>>> Software Engineer, Safety
>>>>>
>>>>
>>>
>>> --
>>> Jacob Sevart
>>> Software Engineer, Safety
>>>
>>
>>
>> --
>> Jacob Sevart
>> Software Engineer, Safety
>>
>

-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Till Rohrmann <tr...@apache.org>.

Hi Jacob,

I think you are running into some deficiencies of Flink's union state here.
The problem is that for every entry in your list state, Flink stores a
separate offset (a long value). The reason for this behaviour is that we
use the same state implementation for the union state as well as for the
split state. For the latter, the offset information is required to split
the state in case of changing the parallelism of your job.

My recommendation would be to try to get rid of union state all together.
The union state has primarily been introduced to checkpoint some source
implementations and might become deprecated due to performance problems
once these sources can be checkpointed differently.

Cheers,
Till

On Sat, Mar 14, 2020 at 3:23 AM Jacob Sevart <js...@uber.com> wrote:

> Oh, I should clarify that's 43MB per partition, so with 48 partitions it
> explains my 2GB.
>
> On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:
>
>> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I found
>> something:
>> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
>> 43MB (5.3 million longs).
>>
>> "startup-times" is an operator state of mine (union list of
>> java.time.Instant). I see a way to end up fewer items in the list, but I'm
>> not sure how the actual size is related to the number of offsets. Can you
>> elaborate on that?
>>
>> Incidentally, 42.5MB is the number I got out of
>> https://issues.apache.org/jira/browse/FLINK-14618. So I think my two
>> problems are closely related.
>>
>> Jacob
>>
>> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> As Gordon said, the metadata will contain the ByteStreamStateHandle,
>>> when writing out the ByteStreamStateHandle, will write out the handle name
>>> -- which is a path(as you saw). The ByteStreamStateHandle will be created
>>> when state size is small than `state.backend.fs.memory-threshold`(default
>>> is 1024).
>>>
>>> If you want to verify this, you can ref the unit test
>>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>>> their names are the strings you saw in the metadata.
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>>
>>>> Thanks, I will monitor that thread.
>>>>
>>>> I'm having a hard time following the serialization code, but if you
>>>> know anything about the layout, tell me if this makes sense. What I see in
>>>> the hex editor is, first, many HDFS paths. Then gigabytes of unreadable
>>>> data. Then finally another HDFS path at the end.
>>>>
>>>> If it is putting state in there, under normal circumstances, does it
>>>> make sense that it would be interleaved with metadata? I would expect all
>>>> the metadata to come first, and then state.
>>>>
>>>> Jacob
>>>>
>>>>
>>>>
>>>> Jacob
>>>>
>>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jacob,
>>>>>
>>>>> As I said previously I am not 100% sure what can be causing this
>>>>> behavior, but this is a related thread here:
>>>>>
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>>
>>>>> Which you can re-post your problem and monitor for answers.
>>>>>
>>>>> Cheers,
>>>>> Kostas
>>>>>
>>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>>>>> >
>>>>> > Kostas and Gordon,
>>>>> >
>>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>>> setting configured so it should be at the default 1024b. This is the full
>>>>> "state.*" section showing in the JobManager UI.
>>>>> >
>>>>> >
>>>>> >
>>>>> > Jacob
>>>>> >
>>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>>> tzulitai@apache.org> wrote:
>>>>> >>
>>>>> >> Hi Jacob,
>>>>> >>
>>>>> >> Apart from what Klou already mentioned, one slightly possible
>>>>> reason:
>>>>> >>
>>>>> >> If you are using the FsStateBackend, it is also possible that your
>>>>> state is small enough to be considered to be stored inline within the
>>>>> metadata file.
>>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>>> configuration, with a default value of 1024 bytes, or can also be
>>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>>> `FsStateBackend`.
>>>>> >> The purpose of that threshold is to ensure that the backend does
>>>>> not create a large amount of very small files, where potentially the file
>>>>> pointers are actually larger than the state itself.
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Gordon
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Hi Jacob,
>>>>> >>>
>>>>> >>> Could you specify which StateBackend you are using?
>>>>> >>>
>>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>>> >>>
>>>>> >>> "Note that if you use the MemoryStateBackend, metadata and
>>>>> savepoint
>>>>> >>> state will be stored in the _metadata file. Since it is
>>>>> >>> self-contained, you may move the file and restore from any
>>>>> location."
>>>>> >>>
>>>>> >>> I am also cc'ing Gordon who may know a bit more about state
>>>>> formats.
>>>>> >>>
>>>>> >>> I hope this helps,
>>>>> >>> Kostas
>>>>> >>>
>>>>> >>> [1]
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>>> >>>
>>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > Per the documentation:
>>>>> >>> >
>>>>> >>> > "The meta data file of a Savepoint contains (primarily) pointers
>>>>> to all files on stable storage that are part of the Savepoint, in form of
>>>>> absolute paths."
>>>>> >>> >
>>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings on
>>>>> it I find 962 strings, most of which look like HDFS paths, which leaves a
>>>>> lot of that file-size unexplained. What else is in there, and how exactly
>>>>> could this be happening?
>>>>> >>> >
>>>>> >>> > We're running 1.6.
>>>>> >>> >
>>>>> >>> > Jacob
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Jacob Sevart
>>>>> > Software Engineer, Safety
>>>>>
>>>>
>>>>
>>>> --
>>>> Jacob Sevart
>>>> Software Engineer, Safety
>>>>
>>>
>>
>> --
>> Jacob Sevart
>> Software Engineer, Safety
>>
>
>
> --
> Jacob Sevart
> Software Engineer, Safety
>

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Oh, I should clarify that's 43MB per partition, so with 48 partitions it
explains my 2GB.

On Fri, Mar 13, 2020 at 7:21 PM Jacob Sevart <js...@uber.com> wrote:

> Running *Checkpoints.loadCheckpointMetadata *under a debugger, I found
> something:
> *subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value *weights
> 43MB (5.3 million longs).
>
> "startup-times" is an operator state of mine (union list of
> java.time.Instant). I see a way to end up fewer items in the list, but I'm
> not sure how the actual size is related to the number of offsets. Can you
> elaborate on that?
>
> Incidentally, 42.5MB is the number I got out of
> https://issues.apache.org/jira/browse/FLINK-14618. So I think my two
> problems are closely related.
>
> Jacob
>
> On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com>
> wrote:
>
>> Hi
>>
>> As Gordon said, the metadata will contain the ByteStreamStateHandle, when
>> writing out the ByteStreamStateHandle, will write out the handle name --
>> which is a path(as you saw). The ByteStreamStateHandle will be created when
>> state size is small than `state.backend.fs.memory-threshold`(default is
>> 1024).
>>
>> If you want to verify this, you can ref the unit test
>> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
>> metadata, you can find out that there are many `ByteStreamStateHandle`, and
>> their names are the strings you saw in the metadata.
>>
>> Best,
>> Congxian
>>
>>
>> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>>
>>> Thanks, I will monitor that thread.
>>>
>>> I'm having a hard time following the serialization code, but if you know
>>> anything about the layout, tell me if this makes sense. What I see in the
>>> hex editor is, first, many HDFS paths. Then gigabytes of unreadable data.
>>> Then finally another HDFS path at the end.
>>>
>>> If it is putting state in there, under normal circumstances, does it
>>> make sense that it would be interleaved with metadata? I would expect all
>>> the metadata to come first, and then state.
>>>
>>> Jacob
>>>
>>>
>>>
>>> Jacob
>>>
>>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jacob,
>>>>
>>>> As I said previously I am not 100% sure what can be causing this
>>>> behavior, but this is a related thread here:
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>>
>>>> Which you can re-post your problem and monitor for answers.
>>>>
>>>> Cheers,
>>>> Kostas
>>>>
>>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>>>> >
>>>> > Kostas and Gordon,
>>>> >
>>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that
>>>> setting configured so it should be at the default 1024b. This is the full
>>>> "state.*" section showing in the JobManager UI.
>>>> >
>>>> >
>>>> >
>>>> > Jacob
>>>> >
>>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>>> tzulitai@apache.org> wrote:
>>>> >>
>>>> >> Hi Jacob,
>>>> >>
>>>> >> Apart from what Klou already mentioned, one slightly possible reason:
>>>> >>
>>>> >> If you are using the FsStateBackend, it is also possible that your
>>>> state is small enough to be considered to be stored inline within the
>>>> metadata file.
>>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>>> configuration, with a default value of 1024 bytes, or can also be
>>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>>> `FsStateBackend`.
>>>> >> The purpose of that threshold is to ensure that the backend does not
>>>> create a large amount of very small files, where potentially the file
>>>> pointers are actually larger than the state itself.
>>>> >>
>>>> >> Cheers,
>>>> >> Gordon
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> Hi Jacob,
>>>> >>>
>>>> >>> Could you specify which StateBackend you are using?
>>>> >>>
>>>> >>> The reason I am asking is that, from the documentation in [1]:
>>>> >>>
>>>> >>> "Note that if you use the MemoryStateBackend, metadata and savepoint
>>>> >>> state will be stored in the _metadata file. Since it is
>>>> >>> self-contained, you may move the file and restore from any
>>>> location."
>>>> >>>
>>>> >>> I am also cc'ing Gordon who may know a bit more about state formats.
>>>> >>>
>>>> >>> I hope this helps,
>>>> >>> Kostas
>>>> >>>
>>>> >>> [1]
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>>> >>>
>>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>>> wrote:
>>>> >>> >
>>>> >>> > Per the documentation:
>>>> >>> >
>>>> >>> > "The meta data file of a Savepoint contains (primarily) pointers
>>>> to all files on stable storage that are part of the Savepoint, in form of
>>>> absolute paths."
>>>> >>> >
>>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings on
>>>> it I find 962 strings, most of which look like HDFS paths, which leaves a
>>>> lot of that file-size unexplained. What else is in there, and how exactly
>>>> could this be happening?
>>>> >>> >
>>>> >>> > We're running 1.6.
>>>> >>> >
>>>> >>> > Jacob
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Jacob Sevart
>>>> > Software Engineer, Safety
>>>>
>>>
>>>
>>> --
>>> Jacob Sevart
>>> Software Engineer, Safety
>>>
>>
>
> --
> Jacob Sevart
> Software Engineer, Safety
>


-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Running *Checkpoints.loadCheckpointMetadata *under a debugger, I found
something:
*subtaskState.managedOperatorState[0].sateNameToPartitionOffsets("startup-times").offsets.value
*weights
43MB (5.3 million longs).

"startup-times" is an operator state of mine (union list of
java.time.Instant). I see a way to end up fewer items in the list, but I'm
not sure how the actual size is related to the number of offsets. Can you
elaborate on that?

Incidentally, 42.5MB is the number I got out of
https://issues.apache.org/jira/browse/FLINK-14618. So I think my two
problems are closely related.

Jacob

On Mon, Mar 9, 2020 at 6:36 AM Congxian Qiu <qc...@gmail.com> wrote:

> Hi
>
> As Gordon said, the metadata will contain the ByteStreamStateHandle, when
> writing out the ByteStreamStateHandle, will write out the handle name --
> which is a path(as you saw). The ByteStreamStateHandle will be created when
> state size is small than `state.backend.fs.memory-threshold`(default is
> 1024).
>
> If you want to verify this, you can ref the unit test
> `CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
> metadata, you can find out that there are many `ByteStreamStateHandle`, and
> their names are the strings you saw in the metadata.
>
> Best,
> Congxian
>
>
> Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：
>
>> Thanks, I will monitor that thread.
>>
>> I'm having a hard time following the serialization code, but if you know
>> anything about the layout, tell me if this makes sense. What I see in the
>> hex editor is, first, many HDFS paths. Then gigabytes of unreadable data.
>> Then finally another HDFS path at the end.
>>
>> If it is putting state in there, under normal circumstances, does it make
>> sense that it would be interleaved with metadata? I would expect all the
>> metadata to come first, and then state.
>>
>> Jacob
>>
>>
>>
>> Jacob
>>
>> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com>
>> wrote:
>>
>>> Hi Jacob,
>>>
>>> As I said previously I am not 100% sure what can be causing this
>>> behavior, but this is a related thread here:
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>>
>>> Which you can re-post your problem and monitor for answers.
>>>
>>> Cheers,
>>> Kostas
>>>
>>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>>> >
>>> > Kostas and Gordon,
>>> >
>>> > Thanks for the suggestions! I'm on RocksDB. We don't have that setting
>>> configured so it should be at the default 1024b. This is the full "state.*"
>>> section showing in the JobManager UI.
>>> >
>>> >
>>> >
>>> > Jacob
>>> >
>>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <
>>> tzulitai@apache.org> wrote:
>>> >>
>>> >> Hi Jacob,
>>> >>
>>> >> Apart from what Klou already mentioned, one slightly possible reason:
>>> >>
>>> >> If you are using the FsStateBackend, it is also possible that your
>>> state is small enough to be considered to be stored inline within the
>>> metadata file.
>>> >> That is governed by the "state.backend.fs.memory-threshold"
>>> configuration, with a default value of 1024 bytes, or can also be
>>> configured with the `fileStateSizeThreshold` argument when constructing the
>>> `FsStateBackend`.
>>> >> The purpose of that threshold is to ensure that the backend does not
>>> create a large amount of very small files, where potentially the file
>>> pointers are actually larger than the state itself.
>>> >>
>>> >> Cheers,
>>> >> Gordon
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> Hi Jacob,
>>> >>>
>>> >>> Could you specify which StateBackend you are using?
>>> >>>
>>> >>> The reason I am asking is that, from the documentation in [1]:
>>> >>>
>>> >>> "Note that if you use the MemoryStateBackend, metadata and savepoint
>>> >>> state will be stored in the _metadata file. Since it is
>>> >>> self-contained, you may move the file and restore from any location."
>>> >>>
>>> >>> I am also cc'ing Gordon who may know a bit more about state formats.
>>> >>>
>>> >>> I hope this helps,
>>> >>> Kostas
>>> >>>
>>> >>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>>> >>>
>>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com>
>>> wrote:
>>> >>> >
>>> >>> > Per the documentation:
>>> >>> >
>>> >>> > "The meta data file of a Savepoint contains (primarily) pointers
>>> to all files on stable storage that are part of the Savepoint, in form of
>>> absolute paths."
>>> >>> >
>>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings on
>>> it I find 962 strings, most of which look like HDFS paths, which leaves a
>>> lot of that file-size unexplained. What else is in there, and how exactly
>>> could this be happening?
>>> >>> >
>>> >>> > We're running 1.6.
>>> >>> >
>>> >>> > Jacob
>>> >
>>> >
>>> >
>>> > --
>>> > Jacob Sevart
>>> > Software Engineer, Safety
>>>
>>
>>
>> --
>> Jacob Sevart
>> Software Engineer, Safety
>>
>

-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Congxian Qiu <qc...@gmail.com>.

Hi

As Gordon said, the metadata will contain the ByteStreamStateHandle, when
writing out the ByteStreamStateHandle, will write out the handle name --
which is a path(as you saw). The ByteStreamStateHandle will be created when
state size is small than `state.backend.fs.memory-threshold`(default is
1024).

If you want to verify this, you can ref the unit test
`CheckpointMetadataLoadingTest#testLoadAndValidateSavepoint` and load the
metadata, you can find out that there are many `ByteStreamStateHandle`, and
their names are the strings you saw in the metadata.

Best,
Congxian


Jacob Sevart <js...@uber.com> 于2020年3月6日周五 上午3:57写道：

> Thanks, I will monitor that thread.
>
> I'm having a hard time following the serialization code, but if you know
> anything about the layout, tell me if this makes sense. What I see in the
> hex editor is, first, many HDFS paths. Then gigabytes of unreadable data.
> Then finally another HDFS path at the end.
>
> If it is putting state in there, under normal circumstances, does it make
> sense that it would be interleaved with metadata? I would expect all the
> metadata to come first, and then state.
>
> Jacob
>
>
>
> Jacob
>
> On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com> wrote:
>
>> Hi Jacob,
>>
>> As I said previously I am not 100% sure what can be causing this
>> behavior, but this is a related thread here:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>>
>> Which you can re-post your problem and monitor for answers.
>>
>> Cheers,
>> Kostas
>>
>> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>> >
>> > Kostas and Gordon,
>> >
>> > Thanks for the suggestions! I'm on RocksDB. We don't have that setting
>> configured so it should be at the default 1024b. This is the full "state.*"
>> section showing in the JobManager UI.
>> >
>> >
>> >
>> > Jacob
>> >
>> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
>> wrote:
>> >>
>> >> Hi Jacob,
>> >>
>> >> Apart from what Klou already mentioned, one slightly possible reason:
>> >>
>> >> If you are using the FsStateBackend, it is also possible that your
>> state is small enough to be considered to be stored inline within the
>> metadata file.
>> >> That is governed by the "state.backend.fs.memory-threshold"
>> configuration, with a default value of 1024 bytes, or can also be
>> configured with the `fileStateSizeThreshold` argument when constructing the
>> `FsStateBackend`.
>> >> The purpose of that threshold is to ensure that the backend does not
>> create a large amount of very small files, where potentially the file
>> pointers are actually larger than the state itself.
>> >>
>> >> Cheers,
>> >> Gordon
>> >>
>> >>
>> >>
>> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi Jacob,
>> >>>
>> >>> Could you specify which StateBackend you are using?
>> >>>
>> >>> The reason I am asking is that, from the documentation in [1]:
>> >>>
>> >>> "Note that if you use the MemoryStateBackend, metadata and savepoint
>> >>> state will be stored in the _metadata file. Since it is
>> >>> self-contained, you may move the file and restore from any location."
>> >>>
>> >>> I am also cc'ing Gordon who may know a bit more about state formats.
>> >>>
>> >>> I hope this helps,
>> >>> Kostas
>> >>>
>> >>> [1]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
>> >>>
>> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
>> >>> >
>> >>> > Per the documentation:
>> >>> >
>> >>> > "The meta data file of a Savepoint contains (primarily) pointers to
>> all files on stable storage that are part of the Savepoint, in form of
>> absolute paths."
>> >>> >
>> >>> > I somehow have a _metadata file that's 1.9GB. Running strings on it
>> I find 962 strings, most of which look like HDFS paths, which leaves a lot
>> of that file-size unexplained. What else is in there, and how exactly could
>> this be happening?
>> >>> >
>> >>> > We're running 1.6.
>> >>> >
>> >>> > Jacob
>> >
>> >
>> >
>> > --
>> > Jacob Sevart
>> > Software Engineer, Safety
>>
>
>
> --
> Jacob Sevart
> Software Engineer, Safety
>

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Thanks, I will monitor that thread.

I'm having a hard time following the serialization code, but if you know
anything about the layout, tell me if this makes sense. What I see in the
hex editor is, first, many HDFS paths. Then gigabytes of unreadable data.
Then finally another HDFS path at the end.

If it is putting state in there, under normal circumstances, does it make
sense that it would be interleaved with metadata? I would expect all the
metadata to come first, and then state.

Jacob



Jacob

On Thu, Mar 5, 2020 at 10:53 AM Kostas Kloudas <kk...@gmail.com> wrote:

> Hi Jacob,
>
> As I said previously I am not 100% sure what can be causing this
> behavior, but this is a related thread here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d-2540-253Cuser.flink.apache.org-253E&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=P3Xd0IFKJTDIG2MMeP-hOSfY4ohoCEUMQEJhvGecSlI&e=
>
> Which you can re-post your problem and monitor for answers.
>
> Cheers,
> Kostas
>
> On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
> >
> > Kostas and Gordon,
> >
> > Thanks for the suggestions! I'm on RocksDB. We don't have that setting
> configured so it should be at the default 1024b. This is the full "state.*"
> section showing in the JobManager UI.
> >
> >
> >
> > Jacob
> >
> > On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
> wrote:
> >>
> >> Hi Jacob,
> >>
> >> Apart from what Klou already mentioned, one slightly possible reason:
> >>
> >> If you are using the FsStateBackend, it is also possible that your
> state is small enough to be considered to be stored inline within the
> metadata file.
> >> That is governed by the "state.backend.fs.memory-threshold"
> configuration, with a default value of 1024 bytes, or can also be
> configured with the `fileStateSizeThreshold` argument when constructing the
> `FsStateBackend`.
> >> The purpose of that threshold is to ensure that the backend does not
> create a large amount of very small files, where potentially the file
> pointers are actually larger than the state itself.
> >>
> >> Cheers,
> >> Gordon
> >>
> >>
> >>
> >> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com>
> wrote:
> >>>
> >>> Hi Jacob,
> >>>
> >>> Could you specify which StateBackend you are using?
> >>>
> >>> The reason I am asking is that, from the documentation in [1]:
> >>>
> >>> "Note that if you use the MemoryStateBackend, metadata and savepoint
> >>> state will be stored in the _metadata file. Since it is
> >>> self-contained, you may move the file and restore from any location."
> >>>
> >>> I am also cc'ing Gordon who may know a bit more about state formats.
> >>>
> >>> I hope this helps,
> >>> Kostas
> >>>
> >>> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=awEv6FqKY6dZ8NIA4KEFc_qQ6aadR_jTAWnO17wtAus&s=fw0c-Ct21HHJv4MzZRicIaltqHLQOrNvqchzNgCdwkA&e=
> >>>
> >>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
> >>> >
> >>> > Per the documentation:
> >>> >
> >>> > "The meta data file of a Savepoint contains (primarily) pointers to
> all files on stable storage that are part of the Savepoint, in form of
> absolute paths."
> >>> >
> >>> > I somehow have a _metadata file that's 1.9GB. Running strings on it
> I find 962 strings, most of which look like HDFS paths, which leaves a lot
> of that file-size unexplained. What else is in there, and how exactly could
> this be happening?
> >>> >
> >>> > We're running 1.6.
> >>> >
> >>> > Jacob
> >
> >
> >
> > --
> > Jacob Sevart
> > Software Engineer, Safety
>


-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by Kostas Kloudas <kk...@gmail.com>.

Hi Jacob,

As I said previously I am not 100% sure what can be causing this
behavior, but this is a related thread here:
https://lists.apache.org/thread.html/r3bfa2a3368a9c7850cba778e4decfe4f6dba9607f32addb69814f43d%40%3Cuser.flink.apache.org%3E

Which you can re-post your problem and monitor for answers.

Cheers,
Kostas

On Wed, Mar 4, 2020 at 7:02 PM Jacob Sevart <js...@uber.com> wrote:
>
> Kostas and Gordon,
>
> Thanks for the suggestions! I'm on RocksDB. We don't have that setting configured so it should be at the default 1024b. This is the full "state.*" section showing in the JobManager UI.
>
>
>
> Jacob
>
> On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <tz...@apache.org> wrote:
>>
>> Hi Jacob,
>>
>> Apart from what Klou already mentioned, one slightly possible reason:
>>
>> If you are using the FsStateBackend, it is also possible that your state is small enough to be considered to be stored inline within the metadata file.
>> That is governed by the "state.backend.fs.memory-threshold" configuration, with a default value of 1024 bytes, or can also be configured with the `fileStateSizeThreshold` argument when constructing the `FsStateBackend`.
>> The purpose of that threshold is to ensure that the backend does not create a large amount of very small files, where potentially the file pointers are actually larger than the state itself.
>>
>> Cheers,
>> Gordon
>>
>>
>>
>> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com> wrote:
>>>
>>> Hi Jacob,
>>>
>>> Could you specify which StateBackend you are using?
>>>
>>> The reason I am asking is that, from the documentation in [1]:
>>>
>>> "Note that if you use the MemoryStateBackend, metadata and savepoint
>>> state will be stored in the _metadata file. Since it is
>>> self-contained, you may move the file and restore from any location."
>>>
>>> I am also cc'ing Gordon who may know a bit more about state formats.
>>>
>>> I hope this helps,
>>> Kostas
>>>
>>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/savepoints.html
>>>
>>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
>>> >
>>> > Per the documentation:
>>> >
>>> > "The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths."
>>> >
>>> > I somehow have a _metadata file that's 1.9GB. Running strings on it I find 962 strings, most of which look like HDFS paths, which leaves a lot of that file-size unexplained. What else is in there, and how exactly could this be happening?
>>> >
>>> > We're running 1.6.
>>> >
>>> > Jacob
>
>
>
> --
> Jacob Sevart
> Software Engineer, Safety

Re: Very large _metadata file

Posted by Jacob Sevart <js...@uber.com>.

Kostas and Gordon,

Thanks for the suggestions! I'm on RocksDB. We don't have that setting
configured so it should be at the default 1024b. This is the full "state.*"
section showing in the JobManager UI.

[image: Screen Shot 2020-03-04 at 9.56.20 AM.png]

Jacob

On Wed, Mar 4, 2020 at 2:45 AM Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi Jacob,
>
> Apart from what Klou already mentioned, one slightly possible reason:
>
> If you are using the FsStateBackend, it is also possible that your state
> is small enough to be considered to be stored inline within the metadata
> file.
> That is governed by the "state.backend.fs.memory-threshold" configuration,
> with a default value of 1024 bytes, or can also be configured with the
> `fileStateSizeThreshold` argument when constructing the `FsStateBackend`.
> The purpose of that threshold is to ensure that the backend does not
> create a large amount of very small files, where potentially the file
> pointers are actually larger than the state itself.
>
> Cheers,
> Gordon
>
>
>
> On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com> wrote:
>
>> Hi Jacob,
>>
>> Could you specify which StateBackend you are using?
>>
>> The reason I am asking is that, from the documentation in [1]:
>>
>> "Note that if you use the MemoryStateBackend, metadata and savepoint
>> state will be stored in the _metadata file. Since it is
>> self-contained, you may move the file and restore from any location."
>>
>> I am also cc'ing Gordon who may know a bit more about state formats.
>>
>> I hope this helps,
>> Kostas
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/savepoints.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.6_ops_state_savepoints.html&d=DwMFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=lTq5mEceM-U-tVfWzKBngg&m=Gj8rciOHU7hUM_QxeMOSC8QqWhJcx_q9M8mrdNqdcm8&s=viMyoVEHWkMIil_1RSpjvlbQx9AFO6C-Sk6oe0U_x40&e=>
>>
>> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
>> >
>> > Per the documentation:
>> >
>> > "The meta data file of a Savepoint contains (primarily) pointers to all
>> files on stable storage that are part of the Savepoint, in form of absolute
>> paths."
>> >
>> > I somehow have a _metadata file that's 1.9GB. Running strings on it I
>> find 962 strings, most of which look like HDFS paths, which leaves a lot of
>> that file-size unexplained. What else is in there, and how exactly could
>> this be happening?
>> >
>> > We're running 1.6.
>> >
>> > Jacob
>>
>

-- 
Jacob Sevart
Software Engineer, Safety

Re: Very large _metadata file

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi Jacob,

Apart from what Klou already mentioned, one slightly possible reason:

If you are using the FsStateBackend, it is also possible that your state is
small enough to be considered to be stored inline within the metadata file.
That is governed by the "state.backend.fs.memory-threshold" configuration,
with a default value of 1024 bytes, or can also be configured with the
`fileStateSizeThreshold` argument when constructing the `FsStateBackend`.
The purpose of that threshold is to ensure that the backend does not create
a large amount of very small files, where potentially the file pointers are
actually larger than the state itself.

Cheers,
Gordon

On Wed, Mar 4, 2020 at 6:17 PM Kostas Kloudas <kk...@gmail.com> wrote:

> Hi Jacob,
>
> Could you specify which StateBackend you are using?
>
> The reason I am asking is that, from the documentation in [1]:
>
> "Note that if you use the MemoryStateBackend, metadata and savepoint
> state will be stored in the _metadata file. Since it is
> self-contained, you may move the file and restore from any location."
>
> I am also cc'ing Gordon who may know a bit more about state formats.
>
> I hope this helps,
> Kostas
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/savepoints.html
>
> On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
> >
> > Per the documentation:
> >
> > "The meta data file of a Savepoint contains (primarily) pointers to all
> files on stable storage that are part of the Savepoint, in form of absolute
> paths."
> >
> > I somehow have a _metadata file that's 1.9GB. Running strings on it I
> find 962 strings, most of which look like HDFS paths, which leaves a lot of
> that file-size unexplained. What else is in there, and how exactly could
> this be happening?
> >
> > We're running 1.6.
> >
> > Jacob
>

Re: Very large _metadata file

Posted by Kostas Kloudas <kk...@gmail.com>.

Hi Jacob,

Could you specify which StateBackend you are using?

The reason I am asking is that, from the documentation in [1]:

"Note that if you use the MemoryStateBackend, metadata and savepoint
state will be stored in the _metadata file. Since it is
self-contained, you may move the file and restore from any location."

I am also cc'ing Gordon who may know a bit more about state formats.

I hope this helps,
Kostas

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/savepoints.html

On Wed, Mar 4, 2020 at 1:25 AM Jacob Sevart <js...@uber.com> wrote:
>
> Per the documentation:
>
> "The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths."
>
> I somehow have a _metadata file that's 1.9GB. Running strings on it I find 962 strings, most of which look like HDFS paths, which leaves a lot of that file-size unexplained. What else is in there, and how exactly could this be happening?
>
> We're running 1.6.
>
> Jacob