You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/04 19:29:00 UTC

[jira] [Commented] (FLINK-7643) Configure FileSystems only once

    [ https://issues.apache.org/jira/browse/FLINK-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191887#comment-16191887 ] 

ASF GitHub Bot commented on FLINK-7643:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/4776

    [FLINK-7643] [core] Rework FileSystem loading to use factories

    ## What is the purpose of the change
    
    This change reworks the loading and instantiation of File System objects (including file systems supported via Hadoop) to use factories. 
    
    This makes sure that configurations (Flink and possibly Hadoop) are loaded once (on TaskManager / JobManager startup) and file system instances are properly reused by scheme and authority. That way, this change 
    
    This change is also a prerequisite for an extensible file system loading mechanism via a service framework.
    
    ## Brief change log
    
      - The special-case configuration of the `FileSystem` class to set the "default file system scheme" is extended to a generic configuration call.
      - The directory of directly supported file systems is changed from classes (instantiated via reflection) to factories.
      - These factories are also configured when the `FileSystem` is configured.
      - The Hadoop file system factory loads the Hadoop configuration once when being configured and applies it to all subsequently instantiated file systems.
      - File systems supported via Hadoop are now properly cached and not reloaded, reinstantiated, and reconfigured on each access.
      - This also throws out a lot of legacy code for how to find Hadoop file system implementations
      - The `FileSystem` class is much cleaner now because a lot of the Hadoop FS
      - All file systems now eagerly initialize their settings, rather than dividing that between the constructor and the `initialize()` method.
      - This also factors out a lot of the special treatment of Hadoop file systems and simply makes the Hadoop File System factory the default fallback factory.
    
    ## Verifying this change
    
    Reworked some tests to cover the behavior of this change:
      - `flink-core/src/test/java/org/apache/flink/configuration/FilesystemSchemeConfigTest.java`
      - `flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskManagerConfigurationTest.java`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (yes / **no**)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (**yes** / no)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
    
    *Note:* The breaking changes made on `@Public` class `FileSystem` do not include methods that are meant for users, but only the setup configuration.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (yes / **no**)
      - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink fs_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4776.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4776
    
----
commit ba312e137c7af1d2c331c5231b5b0ae3e0401549
Author: Stephan Ewen <se...@apache.org>
Date:   2017-10-02T12:34:27Z

    [FLINK-7643] [core] Misc. cleanups in FileSystem
    
      - Simplify access to local file system
      - Use a fair lock for all FileSystem.get() operations
      - Robust falback to local fs for default scheme (avoids URI parsing error on Windows)
      - Deprecate 'getDefaultBlockSize()'
      - Deprecate create(...) with block sizes and replication factor, which is not applicable to many FS

commit 8130d874b8b823f22964f435bf1a1d1bd39774d6
Author: Stephan Ewen <se...@apache.org>
Date:   2017-10-02T14:25:18Z

    [FLINK-7643] [core] Rework FileSystem loading to use factories
    
    This makes sure that configurations are loaded once and file system instances are
    properly reused by scheme and authority.
    
    This also factors out a lot of the special treatment of Hadoop file systems and simply
    makes the Hadoop File System factory the default fallback factory.

commit c652f1322044f9715a0d94fa21ec853769be9a78
Author: Stephan Ewen <se...@apache.org>
Date:   2017-10-02T14:30:07Z

    [FLINK-7643] [core] Drop eager checks for file system support.
    
    Some places validate if the file URIs are resolvable on the client. This leads to
    problems when file systems are not accessible from the client, when the full libraries for
    the file systems are not present on the client (for example often the case in cloud setups),
    or when the configuration on the client is different from the nodes/containers that will
    execute the application.

----


> Configure FileSystems only once
> -------------------------------
>
>                 Key: FLINK-7643
>                 URL: https://issues.apache.org/jira/browse/FLINK-7643
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0
>            Reporter: Ufuk Celebi
>            Assignee: Stephan Ewen
>
> HadoopFileSystem always reloads GlobalConfiguration, which potentially leads to a lot of noise in the logs, because this happens on each checkpoint.
> Instead, file systems should be configured once upon process startup, when the configuration is loaded.
> This will also increase efficiency of checkpoints, as it avoids redundant parsing for each data chunk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)