How to do a fast recursive local folder to/from azure blob storage synchronization

by ingvar 23. June 2011 20:49

Introduction

In this blob post I will describe how to do a synchronize of a local file system folder against a Windows Azure blob container/folder. There are many ways to do this, some faster than others. My way of doing this is especially fast if few files has been added/updated/deleted. If many is added/updated/deleted it is still fast, but uploading/downloadig files to/from the blob will be the main time factor. The algorithm I’m going to describe was developed by me when I was implementing a non-live-editing Window Azure deployment model for Composite C1. You can read more about the setup here. I will do a more technical blog post about this non-live-editing setup later.

Breakdown of the problem

The algorithm should only do one way synchronization. Meaning that it either updates the local file system folder to match whats stored on the blob container. Or updates the blob container to match whats stored in the local folder. So I will split it up into two. One for synchronizing to the blob storage and one from the blob storage. 

Because the blob storage is located on another computer we can’t compare the time stamp of the blobs against timestamps on local files. The reason for this is that the two computers (blob storage and our local) clocks will never be 100% in sync. What we can do, is use file hashes like MD5. The only problem with file hashes is that they are expensive to calculate, so we have to this as little as possible. We can accomplish this by saving the MD5 hash in the blobs metadata and cache the hash for the local file in memory. Even if we convert the hash value to a base 64 string, holding the hash in memory for 10.000 files will cost less than 0.3 mega bytes. So this scales fairly okay. 

When working with the Windows Azure blob storage we have to take care not to do lots request. Especially we should take care not to do a request for every file/blob we process. Each request is likely to take more than 50ms and if we have 10.000 files to process, this will cost more than 8 minutes! So we should never use GetBlobReference/FetchAttributes to see if the blob exists and/or get its MD5 hash. But this is no problem because we can use the ListBlobs method with the right options.

Semi Pseudo Algorithms

Lets start with some semi pseudo code. I have left out some methods and properties but they should be self explaining enouth, to get the overall understanding of the algorithms. I did this so it would be easier to read and understand. Further down I’ll show the full C# code for these algorithms. 

You might wonder why I store the MD5 hash value in the blobs meta data and not in the ContentMD5 property of the blob. The reason is that ContentMD5 is only populated with a value if FetchAttribute is called on the blob, which will make the algorithm perform really bad. Ill do a blog post on the odd behavior of the ContentMD5 blob property in a later blog post. Edit: Read it here.

Download the full source here: BlobSync.cs (7.80 kb).

Synchronizing to the blob

public void SynchronizeToBlob()
{
    DateTime lastSync = LastSyncTime;
    DateTime newLastSyncTime = DateTime.Now;

    IEnumerable allFilesInStartFolder = GetAllFilesInStartFolder();

    var blobOptions = new BlobRequestOptions { UseFlatBlobListing = true };
            
    /* This is the only request to the blob storage that we will do */
    /* except when we have to upload or delete to/from the blob */
    var blobs =
        Container.ListBlobs(blobOptions).
        OfType().
        Select(b => new
        {
            Blob = b,
            LocalPath = GetLocalPath(b)
        }).
        ToList();
    /* We use ToList here to avoid multiple requests when enumerating */

    foreach (string filePath in allFilesInStartFolder)
    {
        string fileHash = GetFileHashFromCache(filePath, lastSync);

        /* Checking for added files */
        var blob = blobs.Where(b => b.LocalPath == filePath).SingleOrDefault();
        if (blob == null) // Does not exist
        {
            UploadToBlobStorage(filePath, fileHash);
        }

        /* Checking for changed files */
        if (fileHash != blob.Blob.Metadata["Hash"])
        {
            UploadToBlobStorage(filePath, fileHash, blob.Blob);
        }
    }

    /* Check for deleted files */
    foreach (var blob in blobs)
    {
        bool exists = allFilesInStartFolder.Where(f => blob.LocalPath == f).Any();

        if (!exists)
        {
            DeleteBlob(blob.Blob);
        }
    }

    LastSyncTime = newLastSyncTime;
}

 

Synchronizing from the blob

public void SynchronizeFromBlob()
{
    IEnumerable allFilesInStartFolder = GetAllFilesInStartFolder();

    var blobOptions = new BlobRequestOptions
    {
        UseFlatBlobListing = true,
        BlobListingDetails = BlobListingDetails.Metadata
    };

    /* This is the only request to the blob storage that we will do */
    /* except when we have to upload or delete to/from the blob */
    var blobs =
        Container.ListBlobs(blobOptions).
        OfType().
        Select(b => new
        {
            Blob = b,
            LocalPath = GetLocalPath(b)
        }).
        ToList();
    /* We use ToList here to avoid multiple requests when enumerating */

    foreach (var blob in blobs)
    {
        /* Checking for added files */
        if (!File.Exists(blob.LocalPath))
        {
            DownloadFromBlobStorage(blob.Blob, blob.LocalPath);
        }

        /* Checking for changed files */
        string fileHash = GetFileHashFromCache(blob.LocalPath);
        if (fileHash != blob.Blob.Metadata["Hash"])
        {
            DownloadFromBlobStorage(blob.Blob, blob.LocalPath);
            UpdateFileHash(blob.LocalPath, blob.Blob.Metadata["Hash"]);
        }
    }

    /* Checking for deleted files */
    foreach (string filePath in allFilesInStartFolder)
    {
        bool exists = blobs.Where(b => b.LocalPath == filePath).Any();
        if (!exists)
        {
            File.Delete(filePath);
        }
    }
}

The rest of the code

In this section I will go through the missing methods and properties from the semi pseudo algorithms above. Most of them are pretty simple and self explaining, but a few of them are more complex and needs more attention. 

LastSyncTime and Container

These are just get/set properties. Container should be initialized with the the blob container that you wish to synchronize to/from. LastSyncTime is initialized with DateTime.MinValue. LocalFolder points the to local directory to synchronize to/from.

private DateTime LastSyncTime { get; set; }      
private CloudBlobContainer Container { get; set; }
/* Ends with a \ */
private string LocalFolder { get; set; }

UploadToBlobStorage

Simply adds the file hash to the blobs metadata and uploads the file.

private void UploadToBlobStorage(string filePath, string fileHash)
{
    string blobPath = filePath.Remove(0, LocalFolder.Length);
    CloudBlob blob = Container.GetBlobReference(blobPath);
    blob.Metadata["Hash"] = fileHash;
    blob.UploadFile(filePath);
}


private void UploadToBlobStorage(string filePath, string fileHash, CloudBlob cloudBlob)
{
    cloudBlob.Metadata["Hash"] = fileHash;
    cloudBlob.UploadFile(filePath);
}

DownloadFromBlobStorage

Simply downloads the blob

private void DownloadFromBlobStorage(CloudBlob blob, string filePath)
{
    blob.DownloadToFile(filePath);
}

DeleteBlob

Simply deletes the blob

private void DeleteBlob(CloudBlob blob)
{
    blob.Delete();
}

GetFileHashFromCache

There are two versions of this method. This one is used when synchronizing to the blob. It uses the LastWriteTime of the file and the last time we did a sync to skip calculating the file hash of files that have not been changed. This saves a lot of time, so its worth the complexity. 

private readonly Dictionary _syncToBlobHashCache = new Dictionary();
private string GetFileHashFromCache(string filePath, DateTime lastSync)
{
    if (File.GetLastWriteTime(filePath) <= lastSync && 
        _syncToBlobHashCache.ContainsKey(filePath))
    {
        return _syncToBlobHashCache[filePath];
    }
    else
    {
        using (FileStream file = new FileStream(filePath, FileMode.Open))
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            string fileHash = Convert.ToBase64String(md5.ComputeHash(file));
            _syncToBlobHashCache[filePath] = fileHash;

            return fileHash;
        }
    }
}

GetFileHashFromCache and UpdateFileHash

This is the other version of the GetFileHashFromCache method. This one is used when synchronizing from the blob. The UpdateFileHash is used for updating the file hash cache when a new hash is obtained from a blob.

private readonly Dictionary _syncFromBlobHashCache = new Dictionary();
private string GetFileHashFromCache(string filePath)
{
    if (_syncFromBlobHashCache.ContainsKey(filePath))
    {
        return _syncFromBlobHashCache[filePath];
    }
    else
    {
        using (FileStream file = new FileStream(filePath, FileMode.Open))
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            string fileHash = Convert.ToBase64String(md5.ComputeHash(file));
            _syncFromBlobHashCache[filePath] = fileHash;

            return fileHash;
        }
    }
}

private void UpdateFileHash(string filePath, string fileHash)
{
    _syncFromBlobHashCache[filePath] = fileHash;
}

GetAllFilesInStartFolder

This method returns all files in the start folder given by the LocalFolder property. It ToLowers all file paths. This is done because blobs names are case sensitive, so when we compare paths returned from the method we want to compare on all lower cased paths. When comparing paths we also use the GetLocalPath method which translates a blob path to a local path and also ToLowers the result. 

private IEnumerable GetAllFilesInStartFolder()
{
    Queue foldersToProcess = new Queue();
    foldersToProcess.Enqueue(LocalFolder);

    while (foldersToProcess.Count > 0)
    {
        string currentFolder = foldersToProcess.Dequeue();
        foreach (string subFolder in Directory.GetDirectories(currentFolder))
        {
            foldersToProcess.Enqueue(subFolder);
        }

        foreach (string filePath in Directory.GetFiles(currentFolder))
        {
            yield return filePath.ToLower();
        }
    }
}

GetLocalPath

Returns the local path of the given blob using the LocalFolder property as base folder. It ToLowers the result so we only compare all lower cased paths due to that blob names are case sensitive. 

private string GetLocalPath(CloudBlob blob)
{
    /* Path should only use \ and no /  */
    string path = blob.Uri.LocalPath.Remove(blob.Container.Name.Length + 2).Replace('/', '\\');

    /* Blob names are case sensitive, so when we check local */
    /* filenames agains blob names we tolower all of it */
    return Path.Combine(LocalFolder, path).ToLower();
}

Tags:

.NET | Azure | C# | Blob

Comments (9) -

SMDB
SMDB United Kingdom
7/19/2011 4:00:02 PM #

Code not working for me. Tried a small test (sync from blob to file) using local storage. Getting error : access to the path 'c:\devstore' is denied

Reply

ingvar
ingvar Denmark
7/19/2011 8:14:12 PM #

@SMDB At what point do you get that exception?

Reply

SMDB
SMDB United Kingdom
7/20/2011 10:14:16 AM #

blob.DownloadToFile(filePath); It is here (inside DownloadFromBlobStorage fct) that i get the error message

Reply

ingvar
ingvar Denmark
10/4/2011 9:10:13 AM #

Seems that you do not have local permissions to write files to c:\devstore when executing your program.

Reply

Yaron Levi
Yaron Levi Israel
10/4/2011 2:48:12 AM #

" We use ToList here to avoid multiple requests when enumerating  "  ....  Are you sure that the ToList is needed here ?  It seems like anyway only one call is made to the blob storage. I've posted a question about it :

stackoverflow.com/.../cloudblobcontainer-listblobs-using-tolist-to-reduce-transaction-cost

What do you think ?

Reply

ingvar
ingvar Denmark
10/4/2011 9:09:31 AM #

Your question on stackoverflow is not int entirely the same as what I’m taking about in my blob post. In my code I execute this statement:

var blob = blobs.Where(b => b.LocalPath == filePath).SingleOrDefault();

Multiple times and if blobs was not created with ToList, it would execute a REST and fetch them every time this line is executed. I tested this using Fiddler
Does this explain the difference?

Reply

bhuvan bhatt
bhuvan bhatt India
4/15/2014 8:44:06 AM #

Are you writing this codes in VB.Net or what platform?

Reply

ingvar
ingvar Denmark
4/15/2014 9:47:42 AM #

C# / .NET

Reply

bhuvan bhatt
bhuvan bhatt India
4/15/2014 10:10:09 AM #

Ok thanks I have started this in VS 2013 Ultimate....

Reply

Pingbacks and trackbacks (2)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

About the author

Martin Ingvar Kofoed Jensen

Architect and Senior Developer at Composite on the open source project Composite C1 - C#/4.0, LINQ, Azure, Parallel and much more!

Follow me on Twitter

Read more about me here.

Read press and buzz about my work and me here.

Stack Overflow

Month List