You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2010/07/26 15:53:53 UTC

[jira] Commented: (JCR-2682) Allow the FileDataStore to scale over several millions of files

    [ https://issues.apache.org/jira/browse/JCR-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892309#action_12892309 ] 

Thomas Mueller commented on JCR-2682:
-------------------------------------

Hi,

Thanks for the patch! It it short, and easy to understand. However, I have a few questions / remarks:

If scanning the directory takes so long, is this not also a problem for the data store garbage collection? How to solve this problem?

With your patch, it's possible that the same binary is stored again each month. If you have one directory per month then this might not be a big problem however.

What about using a slightly different algorithm instead: use 3 directory: "new", "old", and "old-marked". When storing a binary, the algorithm first needs to check in "old", "new", and possible "old-marked" (if data store garbage collection is currently running) whether the entry already exists. If yes, the binary is left where it is (except if garbage collection is running, in which case the file is moved from "old" to "old-marked"). New entries are stored in "new".

The backup is running, it only needs to backup the directory "new". After the backup is complete, you need to move all files from "new" to "old" / "old-marked".

Data store garbage collection moves all used entries from "old" to "old-marked". After that, the directory "old" is deleted, and "old-marked" is renamed to "old".

When reading, the algorithm need to check in all directories and pick the first one found.


> Allow the FileDataStore to scale over several millions of files
> ---------------------------------------------------------------
>
>                 Key: JCR-2682
>                 URL: https://issues.apache.org/jira/browse/JCR-2682
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 1.6.2, 2.1.0
>         Environment: Linux (Red-Hat)
>            Reporter: Vincent Larchet
>
> in a project where we handle several millions of documents stored in JackRabbit using the FileDataStore we encountered issues related to the file system istelf (ext3) and with our backup tool.
> The root cause is that having millions of files in the same file system is quite hard, and with the way files are stored (using directories built upon the fil content's hash), the backup tool has to scan the whole Table Of Content to detect what has changed. In our case it takes approx. 2.5 hours to scan the 5+ millions files.
> My idea was to be able to use several file systems mounted in the same FileDataStore and declare some as read-only (thus the backup tool does not have to scan them to find new files).
> I made a working prototype by enhancing the FileDataStore to have a new level at the top of the folders hierarchy, this folder changing with document insertion date, the granularity being configured by a pattern (compatible with SimpleDateFormat provided in the FileDataStore spring configuration)
> Example:
> * if we specify 
> <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
>     [...]
>     <param name="prefixDatePattern" value="yyyy-MM" />
> </DataStore>
> * then a folder ${FileDataStore.path}/2010-07/ will be created this month, this folder containing the usual 3 level folder hierarchy built with content's hash
> * this allows to mount a dedicated file system on this folder: In our case (we do not modifiy existing data), next month (in August), this filesystem will be re-mounted in read-only and the backup tool will just skip it most of the time
> NOTE: implementation is 100% backward compatible, without changing the current FileDataStore does not change the way they are persisted and it is possible to change the config without having to extract/re-import all previous files (of course, "old" documents will keep their "old" path on the hard-drive)
> --------
> seems that I can't upload files, so here is the patch for the trunk (only FileDataStore is impacted) :
> {code:title=FileDataStore.java}
> --- FileDataStore.2.0-orig.java	lun. juil. 19 15:50:13 2010
> +++ FileDataStore.java	lun. juil. 19 15:52:55 2010
> @@ -26,8 +26,10 @@
>  import java.security.MessageDigest;
>  import java.security.NoSuchAlgorithmException;
>  import java.sql.Timestamp;
> +import java.text.SimpleDateFormat;
>  import java.util.ArrayList;
>  import java.util.Collections;
> +import java.util.Date;
>  import java.util.Iterator;
>  import java.util.List;
>  import java.util.Map;
> @@ -85,6 +87,11 @@
>       * Must be at least 3 characters.
>       */
>      private static final String TMP = "tmp";
> +    
> +    /**
> +     * Separator used for differencing the date part from the hash in the identifier.
> +     */
> +    private static final String DATE_SEP = "#";
>  
>      /**
>       * The minimum modified date. If a file is accessed (read or write) with a modified date
> @@ -105,6 +112,12 @@
>      private String path;
>  
>      /**
> +     * The date pattern to use as a prefix for directories in the repository. Set it to
> +     * null or an empty string to disable this feature.
> +     */
> +    private String prefixDatePattern;
> +
> +	/**
>       * The minimum size of an object that should be stored in this data store.
>       */
>      private int minRecordLength = DEFAULT_MIN_RECORD_LENGTH;
> @@ -116,6 +129,13 @@
>          Collections.synchronizedMap(new WeakHashMap<DataIdentifier, WeakReference<DataIdentifier>>());
>  
>      /**
> +     * Creates a uninitialized data store.
> +     *
> +     */
> +    public FileDataStore() {
> +    }
> +
> +    /**
>       * Initialized the data store.
>       * If the path is not set, &lt;repository home&gt;/repository/datastore is used.
>       * This directory is automatically created if it does not yet exist.
> @@ -199,7 +219,22 @@
>              } finally {
>                  output.close();
>              }
> -            DataIdentifier identifier = new DataIdentifier(digest.digest());
> +
> +            // Convert the digest to an hexadecimal string...
> +            String id = new DataIdentifier(digest.digest()).toString();
> +            
> +            // ... and prepend it with the current date if prefixDatePattern is set.
> +            String prefixDatePattern = getPrefixDatePattern();
> +            if (null != prefixDatePattern && !"".equals(prefixDatePattern)) {
> +            	try {
> +            		SimpleDateFormat sdf = new SimpleDateFormat(prefixDatePattern);
> +            		String prefixDate = sdf.format(new Date());
> +            		id = prefixDate + DATE_SEP + id;
> +            	} catch (IllegalArgumentException e) {
> +            		log.warn("Date pattern ["+prefixDatePattern+"] is incorrect. Ignoring the prefixDatePattern for FileDataStore.");
> +            	}
> +            }
> +            DataIdentifier identifier = new DataIdentifier(id);
>              File file;
>  
>              synchronized (this) {
> @@ -267,9 +302,16 @@
>          usesIdentifier(identifier);
>          String string = identifier.toString();
>          File file = directory;
> -        file = new File(file, string.substring(0, 2));
> -        file = new File(file, string.substring(2, 4));
> -        file = new File(file, string.substring(4, 6));
> +        int indexDate = string.indexOf(DATE_SEP);
> +        if (indexDate > -1) {
> +        	file = new File(file, string.substring(0, indexDate));
> +        	indexDate++; // To ignore the date separator
> +        } else {
> +        	indexDate = 0;
> +        }
> +        file = new File(file, string.substring(indexDate, indexDate+2));
> +        file = new File(file, string.substring(indexDate+2, indexDate+4));
> +        file = new File(file, string.substring(indexDate+4, indexDate+6));
>          return new File(file, string);
>      }
>  
> @@ -378,6 +420,28 @@
>          this.path = directoryName;
>      }
>  
> +    /**
> +     * Get the date pattern to use as a prefix for the data store repository.
> +     * 
> +     * @return the date pattern
> +     */
> +    public String getPrefixDatePattern() {
> +		return prefixDatePattern;
> +	}
> +
> +    /**
> +     * Set the date pattern to use as a prefix for the data store repository.
> +     * 
> +     * @param prefixDatePattern the date pattern
> +     */
> +	public void setPrefixDatePattern(String prefixDatePattern) {
> +		// We want to prevent the inclusion of the DATE_SEP character in the date prefix
> +		if (prefixDatePattern.indexOf(DATE_SEP) > -1) {
> +			log.warn("Do not use the character ["+DATE_SEP+"] in your date pattern for FileDataStore!");
> +		}
> +		this.prefixDatePattern = prefixDatePattern.replaceAll(DATE_SEP, "");
> +	}
> +
>      public int getMinRecordLength() {
>          return minRecordLength;
>      }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.