You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@jclouds.apache.org by "Lari Sinisalo (JIRA)" <ji...@apache.org> on 2019/01/18 10:14:00 UTC

[jira] [Commented] (JCLOUDS-1488) Filesystem list call with prefix is slow in large containers

    [ https://issues.apache.org/jira/browse/JCLOUDS-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746126#comment-16746126 ] 

Lari Sinisalo commented on JCLOUDS-1488:
----------------------------------------

In org.jclouds.blobstore.config.LocalBlobStore.list(String, ListContainerOptions), there is the following code:

{code}
  // Loading blobs from container
  Iterable<String> blobBelongingToContainer = null;
  try {
     blobBelongingToContainer = storageStrategy.getBlobKeysInsideContainer(containerName);
  } catch (IOException e) {
     logger.error(e, "An error occurred loading blobs contained into container %s", containerName);
     propagate(e);
  }
{code}

This getBlobKeysInsideContainer lists the keys of all blobs inside the container. It takes only the container name as a parameter, so it will always ignore the prefix in the ListContainerOptions.

The getBlobKeysInsideContainer implementation in FilesystemStorageStrategyImpl is as follows:

{code}
   /**
    * Returns all the blobs key inside a container
    *
    * @param container
    * @return
    * @throws IOException
    */
   @Override
   public Iterable<String> getBlobKeysInsideContainer(String container) throws IOException {
      filesystemContainerNameValidator.validate(container);
      // check if container exists
      // TODO maybe an error is more appropriate
      Set<String> blobNames = Sets.newHashSet();
      if (!containerExists(container)) {
         return blobNames;
      }

      File containerFile = openFolder(container);
      final int containerPathLength = containerFile.getAbsolutePath().length() + 1;
      populateBlobKeysInContainer(containerFile, blobNames, new Function<String, String>() {
         @Override
         public String apply(String string) {
            return denormalize(string.substring(containerPathLength));
         }
      });
      return blobNames;
   }
{code}

The openFolder call here opens the container root directory. It seems that if this call would receive a subdirectory path instead, the list call would be much more efficient.

I am not quite sure what would be the appropriate way to extract the subdirectory path from the prefix. This would need to be done in a way that does not allow path traversal outside the container root directory. Passing the necessary information to getBlobKeysInsideContainer would also require interface changes.

> Filesystem list call with prefix is slow in large containers
> ------------------------------------------------------------
>
>                 Key: JCLOUDS-1488
>                 URL: https://issues.apache.org/jira/browse/JCLOUDS-1488
>             Project: jclouds
>          Issue Type: Bug
>          Components: jclouds-blobstore
>    Affects Versions: 2.1.1
>         Environment: Java version: java version "1.8.0_131"
> Operating system: Fedora 27 x86_64
>            Reporter: Lari Sinisalo
>            Priority: Major
>              Labels: filesystem
>         Attachments: JCLOUDS1488.java
>
>
> When the filesystem blobstore is used, running the following code takes very long if there are a lot of files in the container:
> {code:java}
>     ListContainerOptions options = new ListContainerOptions();
>     options.prefix("test-container-subdirectory/");
>     Set<? extends StorageMetadata> results =
>       blobStore.list("test-container",options);
> {code}
> See the attached Java source file [^JCLOUDS1488.java] for the full code.
> On my system, running the attached Java code takes over 10 seconds to list a single file if there are 500,000 files in the container outside that prefix.
> Output from the attached code:
> {code:java}
> Number of blobs listed: 1
> First listed blob: test-container-subdirectory/file-to-list
> Time it took to list the blobs: 13256 ms
> {code}
> A more general version of this problem was reported previously in JCLOUDS-1371.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)