ContentMD5 not working in the Storage Client library

by ingvar 28. juni 2011 16:33

Edit: 16th february

This bug was fixed in the Windows Azure SDK v. 1.5 or v. 1.6. So go download the newest version!

 

Edit: 6th july

This is a bug and it has been filed withint the team thats developing the Windows Azure Client Library. More information here.

 

Edit: 1st july - new version of this blog post:

I think my old version (see below) was not totally clear on how to reproduce what I was trying to say. So I have made a new code snip that will show that the ContentMD5 blob property is not populated by the Storage Client library event if the MD5 value is in the REST result. Below is the code that show it. To try the code, just copy/past it to a console application and change the account, storageKey and containerName to match your settings.

I should also add, that I'm not doing this to show of or talk down on Microsofts work with Windows Azure. I think all the Azure stuff is awesome!! It is just that, if this is a bug, and if it is fixed, it will make working with MD5 hash values on blobs a lot easier and faster. If I'm wrong I will take it all back :)

Here is the output of the code:

/md5test/MyBlob.txt MD5 =
MyBlob.txt MD5 = zhFORQHS9OLc6j4XtUbzOQ==

Here is the output of the code:

string account = "ACCOUNT NAME";
string storageKey = "STORAGE KEY";
string containerName = "md5test";

/* ListBlobs done by storage library */
StorageCredentialsAccountAndKey credentials =
    new StorageCredentialsAccountAndKey(
        account, 
        storageKey);
CloudStorageAccount storageAccount = new CloudStorageAccount(credentials, false);
CloudBlobClient client = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = client.GetContainerReference(containerName);
container.CreateIfNotExist();

string blobContent = "This is a test";
MD5 md5 = new MD5CryptoServiceProvider();
byte[] hashBytes = md5.ComputeHash(Encoding.UTF8.GetBytes(blobContent));
CloudBlob cloudBlob = container.GetBlobReference("MyBlob.txt");
cloudBlob.Properties.ContentMD5 = Convert.ToBase64String(hashBytes);
cloudBlob.UploadText(blobContent);

foreach (var blob in container.ListBlobs().OfType())
{
    Console.WriteLine(blob.Uri.LocalPath + " MD5 = " +
        blob.Properties.ContentMD5 ?? "Not populated");
}

/* ListBlobs done by REST */
DateTime now = DateTime.UtcNow;

string nowString = now.ToString("R", CultureInfo.InvariantCulture);

string canonicalizedResource = 
    string.Format("/{0}/{1}\ncomp:list\nrestype:container", 
    account, containerName);
string canonicalizedHeaders = 
    string.Format("x-ms-date:{0}\nx-ms-version:2009-09-19", 
    nowString);

string messageSignature =
    string.Format("GET\n\n\n\n\n\n\n\n\n\n\n\n{0}\n{1}",
    canonicalizedHeaders,
    canonicalizedResource

);

byte[] SignatureBytes = Encoding.UTF8.GetBytes(messageSignature);
HMACSHA256 SHA256 = new HMACSHA256(Convert.FromBase64String(storageKey));
String authorizationHeader = 
    string.Format("SharedKey {0}:{1}", 
    account, 
    Convert.ToBase64String(SHA256.ComputeHash(SignatureBytes)));


Uri url = new Uri(
   string.Format("http://{0}.blob.core.windows.net/{1}?restype=container&comp=list",
   account, containerName));
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
request.Method = "GET";
request.ContentLength = 0;
request.Headers.Add("x-ms-date", nowString);
request.Headers.Add("x-ms-version", "2009-09-19");
request.Headers.Add("Authorization", authorizationHeader);

HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);

string responseString = reader.ReadToEnd();

reader.Close();
responseStream.Close();
response.Close();

XElement responseElement = XElement.Parse(responseString);

foreach (XElement element in responseElement.Descendants("Content-MD5"))
{
    string name = element.Parent.Parent.Element("Name").Value;
    string md5value = element.Value;

    if (string.IsNullOrEmpty(md5value))
    {
        md5value = "Not populated";
    }

    Console.WriteLine(name + " MD5 = " + md5value);
}

 

 

 

Original version of this blog post

This is a follow up on my other blob post about the ContentMD5 blob property (Why is MD5 not part of ListBlobs result?). And it is motivated by the comment from Gaurav Mantri stating that the MD5 is actually in the REST response, it is just not populated by the Storage Client library.

So I also did some more testing and I found out that the MD5 is calculated by the and set blob stroge when you upload the blob. Because when I did a simple test of creating a new container and uploading a simple 'Hello World' txt file and tried doing a ListBlobs through the REST API, I got the XML back below. Note that I did not populate the ContentMD5 when I uploaded the file! If I try doing a ListBlobs method call on a CloudBlobContainer instance the ContentMD5 is NOT populated (Read more about this in my other blog post). So my best guess is that this is a error in the API of some sort.

I would be very nice if this was fixed, so we did not have to compuste the MD5 hash by hand and adding it to the blobs metadata. I would make code like I did for the local folder to/from blob storage synchronization much simpler (Read it here). 


<EnumerationResults ContainerName="http://myblobtest.blob.core.windows.net/md5test">
 <Blobs>
  <Blob>
   <Name>MyTestBlob.xml</Name>
   <Url>http://myblobtest.blob.core.windows.net/md5test/MyTestBlob.xml</Url>
   <Properties>
   <Last-Modified>Tue, 28 Jun 2011 16:20:30 GMT</Last-Modified>
   <Etag>0x8CE038BC293943D</Etag>
   <Content-Length>12</Content-Length>
   <Content-Type>text/xml</Content-Type>
   <Content-Encoding/>
   <Content-Language/>
   <Content-MD5>LL0183lu9Ms2aq/eo4TesA==</Content-MD5>
   <Cache-Control/>
   <BlobType>BlockBlob</BlobType>
   <LeaseStatus>unlocked</LeaseStatus>
  </Properties>
  </Blob>
 </Blobs>
 <NextMarker/>
</EnumerationResults>

 

Here is the code I did for doing the REST call:

string requestUrl = 
"http://myblobtest.blob.core.windows.net/md5test?restype=container&comp=list"; string certificatePath = /* Path to certificate */ HttpWebRequest httpWebRequest =
(HttpWebRequest)HttpWebRequest.Create(new Uri(requestUrl, true)); httpWebRequest.ClientCertificates.Add(new X509Certificate2(certificatePath)); httpWebRequest.Headers.Add("x-ms-version", "2009-09-19"); httpWebRequest.ContentType = "application/xml"; httpWebRequest.ContentLength = 0; httpWebRequest.Method = "GET"; using (HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse()) { using (Stream responseStream = httpWebResponse.GetResponseStream()) { using (StreamReader reader = new StreamReader(responseStream)) { Console.WriteLine(reader.ReadToEnd()); } } }

Tags:

.NET | Azure | Blob

Why is MD5 not part of ListBlobs result?

by ingvar 26. juni 2011 20:13

Edit 6th july:

It turns out that this is a bug in the Windows Azure Client Library. Read more here.

 

Original version:

For some strange reason the ContentMD5 blob property is only populated if you call FetchAttributes on the blob. This means that you can not obtain the value of the ContentMD5 blob property by calling ListBlobs, with any parameter settings. If you have to call FetchAttributes to obtain the value it is not feasible to use on a large set of blobs because of the request time for calling FetchAttributes is going to take a long time. This means that it can not be used in a optimal way when doing something like fast folder synchronization I wrote about in this blog post.

The ContentMD5 blob property is never set by the blob storage it self. And if you try to put any value in it that is not a correct 128 MD5 hash of the file, you will get an exception. So it has limited usage.

I have made some code to illustrate the missing population of the ContentMD5 blob property when using the listBlobs method. 

Here is the output of running the code:

GetBlobReference without FetchAttributes: Not populated
GetBlobReference with FetchAttributes: zhFORQHS9OLc6j4XtUbzOQ==
ListBlobs with BlobListingDetails.None: Not populated
ListBlobs with BlobListingDetails.Metadata: Not populated
ListBlobs with BlobListingDetails.All: Not populated

And here is the code:

CloudStorageAccount storageAccount = /* Initialize this */;
CloudBlobClient client = storageAccount.CreateCloudBlobClient();

CloudBlobContainer container = client.GetContainerReference("mytest");
container.CreateIfNotExist();

const string blobContent = "This is a test";

/* Compute 128 MD5 hash of content */
MD5 md5 = new MD5CryptoServiceProvider();
byte[] hashBytes = md5.ComputeHash(Encoding.UTF8.GetBytes(blobContent));

/* Get the blob reference and upload it with MD5 hash */
CloudBlob blob = container.GetBlobReference("MyBlob1.txt");
blob.Properties.ContentMD5 = Convert.ToBase64String(hashBytes);
blob.UploadText(blobContent);



CloudBlob blob1 = container.GetBlobReference("MyBlob1.txt");

/* Not populated - this is expected */
Console.WriteLine("GetBlobReference without FetchAttributes: " + 
    (blob1.Properties.ContentMD5 ?? "Not populated"));



blob1.FetchAttributes();

/* Populated - this is expected */
Console.WriteLine("GetBlobReference with FetchAttributes: " + 
    (blob1.Properties.ContentMD5 ?? "Not populated"));



CloudBlob blob2 = container.ListBlobs().OfType().Single();
            
/* Not populated - this is NOT expected */
Console.WriteLine("ListBlobs with BlobListingDetails.None: " + 
    (blob2.Properties.ContentMD5 ?? "Not populated"));



CloudBlob blob3 = container.ListBlobs(new BlobRequestOptions
    { BlobListingDetails = BlobListingDetails.Metadata }).
    OfType().Single();

/* Not populated - this is NOT expected */
Console.WriteLine("ListBlobs with BlobListingDetails.Metadata: " + 
    (blob3.Properties.ContentMD5 ?? "Not populated"));



CloudBlob blob4 = container.ListBlobs(new BlobRequestOptions
    {
        UseFlatBlobListing = true,
        BlobListingDetails = BlobListingDetails.All
    }).
    OfType().Single();

/* Not populated - this is NOT expected */
Console.WriteLine("ListBlobs with BlobListingDetails.All: " + 
    (blob4.Properties.ContentMD5 ?? "Not populated"));

Tags:

.NET | Azure | Blob | C#

How to do a fast recursive local folder to/from azure blob storage synchronization

by ingvar 23. juni 2011 20:49

Introduction

In this blob post I will describe how to do a synchronize of a local file system folder against a Windows Azure blob container/folder. There are many ways to do this, some faster than others. My way of doing this is especially fast if few files has been added/updated/deleted. If many is added/updated/deleted it is still fast, but uploading/downloadig files to/from the blob will be the main time factor. The algorithm I’m going to describe was developed by me when I was implementing a non-live-editing Window Azure deployment model for Composite C1. You can read more about the setup here. I will do a more technical blog post about this non-live-editing setup later.

Breakdown of the problem

The algorithm should only do one way synchronization. Meaning that it either updates the local file system folder to match whats stored on the blob container. Or updates the blob container to match whats stored in the local folder. So I will split it up into two. One for synchronizing to the blob storage and one from the blob storage. 

Because the blob storage is located on another computer we can’t compare the time stamp of the blobs against timestamps on local files. The reason for this is that the two computers (blob storage and our local) clocks will never be 100% in sync. What we can do, is use file hashes like MD5. The only problem with file hashes is that they are expensive to calculate, so we have to this as little as possible. We can accomplish this by saving the MD5 hash in the blobs metadata and cache the hash for the local file in memory. Even if we convert the hash value to a base 64 string, holding the hash in memory for 10.000 files will cost less than 0.3 mega bytes. So this scales fairly okay. 

When working with the Windows Azure blob storage we have to take care not to do lots request. Especially we should take care not to do a request for every file/blob we process. Each request is likely to take more than 50ms and if we have 10.000 files to process, this will cost more than 8 minutes! So we should never use GetBlobReference/FetchAttributes to see if the blob exists and/or get its MD5 hash. But this is no problem because we can use the ListBlobs method with the right options.

Semi Pseudo Algorithms

Lets start with some semi pseudo code. I have left out some methods and properties but they should be self explaining enouth, to get the overall understanding of the algorithms. I did this so it would be easier to read and understand. Further down I’ll show the full C# code for these algorithms. 

You might wonder why I store the MD5 hash value in the blobs meta data and not in the ContentMD5 property of the blob. The reason is that ContentMD5 is only populated with a value if FetchAttribute is called on the blob, which will make the algorithm perform really bad. Ill do a blog post on the odd behavior of the ContentMD5 blob property in a later blog post. Edit: Read it here.

Download the full source here: BlobSync.cs (7.80 kb).

Synchronizing to the blob

public void SynchronizeToBlob()
{
    DateTime lastSync = LastSyncTime;
    DateTime newLastSyncTime = DateTime.Now;

    IEnumerable allFilesInStartFolder = GetAllFilesInStartFolder();

    var blobOptions = new BlobRequestOptions { UseFlatBlobListing = true };
            
    /* This is the only request to the blob storage that we will do */
    /* except when we have to upload or delete to/from the blob */
    var blobs =
        Container.ListBlobs(blobOptions).
        OfType().
        Select(b => new
        {
            Blob = b,
            LocalPath = GetLocalPath(b)
        }).
        ToList();
    /* We use ToList here to avoid multiple requests when enumerating */

    foreach (string filePath in allFilesInStartFolder)
    {
        string fileHash = GetFileHashFromCache(filePath, lastSync);

        /* Checking for added files */
        var blob = blobs.Where(b => b.LocalPath == filePath).SingleOrDefault();
        if (blob == null) // Does not exist
        {
            UploadToBlobStorage(filePath, fileHash);
        }

        /* Checking for changed files */
        if (fileHash != blob.Blob.Metadata["Hash"])
        {
            UploadToBlobStorage(filePath, fileHash, blob.Blob);
        }
    }

    /* Check for deleted files */
    foreach (var blob in blobs)
    {
        bool exists = allFilesInStartFolder.Where(f => blob.LocalPath == f).Any();

        if (!exists)
        {
            DeleteBlob(blob.Blob);
        }
    }

    LastSyncTime = newLastSyncTime;
}

 

Synchronizing from the blob

public void SynchronizeFromBlob()
{
    IEnumerable allFilesInStartFolder = GetAllFilesInStartFolder();

    var blobOptions = new BlobRequestOptions
    {
        UseFlatBlobListing = true,
        BlobListingDetails = BlobListingDetails.Metadata
    };

    /* This is the only request to the blob storage that we will do */
    /* except when we have to upload or delete to/from the blob */
    var blobs =
        Container.ListBlobs(blobOptions).
        OfType().
        Select(b => new
        {
            Blob = b,
            LocalPath = GetLocalPath(b)
        }).
        ToList();
    /* We use ToList here to avoid multiple requests when enumerating */

    foreach (var blob in blobs)
    {
        /* Checking for added files */
        if (!File.Exists(blob.LocalPath))
        {
            DownloadFromBlobStorage(blob.Blob, blob.LocalPath);
        }

        /* Checking for changed files */
        string fileHash = GetFileHashFromCache(blob.LocalPath);
        if (fileHash != blob.Blob.Metadata["Hash"])
        {
            DownloadFromBlobStorage(blob.Blob, blob.LocalPath);
            UpdateFileHash(blob.LocalPath, blob.Blob.Metadata["Hash"]);
        }
    }

    /* Checking for deleted files */
    foreach (string filePath in allFilesInStartFolder)
    {
        bool exists = blobs.Where(b => b.LocalPath == filePath).Any();
        if (!exists)
        {
            File.Delete(filePath);
        }
    }
}

The rest of the code

In this section I will go through the missing methods and properties from the semi pseudo algorithms above. Most of them are pretty simple and self explaining, but a few of them are more complex and needs more attention. 

LastSyncTime and Container

These are just get/set properties. Container should be initialized with the the blob container that you wish to synchronize to/from. LastSyncTime is initialized with DateTime.MinValue. LocalFolder points the to local directory to synchronize to/from.

private DateTime LastSyncTime { get; set; }      
private CloudBlobContainer Container { get; set; }
/* Ends with a \ */
private string LocalFolder { get; set; }

UploadToBlobStorage

Simply adds the file hash to the blobs metadata and uploads the file.

private void UploadToBlobStorage(string filePath, string fileHash)
{
    string blobPath = filePath.Remove(0, LocalFolder.Length);
    CloudBlob blob = Container.GetBlobReference(blobPath);
    blob.Metadata["Hash"] = fileHash;
    blob.UploadFile(filePath);
}


private void UploadToBlobStorage(string filePath, string fileHash, CloudBlob cloudBlob)
{
    cloudBlob.Metadata["Hash"] = fileHash;
    cloudBlob.UploadFile(filePath);
}

DownloadFromBlobStorage

Simply downloads the blob

private void DownloadFromBlobStorage(CloudBlob blob, string filePath)
{
    blob.DownloadToFile(filePath);
}

DeleteBlob

Simply deletes the blob

private void DeleteBlob(CloudBlob blob)
{
    blob.Delete();
}

GetFileHashFromCache

There are two versions of this method. This one is used when synchronizing to the blob. It uses the LastWriteTime of the file and the last time we did a sync to skip calculating the file hash of files that have not been changed. This saves a lot of time, so its worth the complexity. 

private readonly Dictionary _syncToBlobHashCache = new Dictionary();
private string GetFileHashFromCache(string filePath, DateTime lastSync)
{
    if (File.GetLastWriteTime(filePath) <= lastSync && 
        _syncToBlobHashCache.ContainsKey(filePath))
    {
        return _syncToBlobHashCache[filePath];
    }
    else
    {
        using (FileStream file = new FileStream(filePath, FileMode.Open))
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            string fileHash = Convert.ToBase64String(md5.ComputeHash(file));
            _syncToBlobHashCache[filePath] = fileHash;

            return fileHash;
        }
    }
}

GetFileHashFromCache and UpdateFileHash

This is the other version of the GetFileHashFromCache method. This one is used when synchronizing from the blob. The UpdateFileHash is used for updating the file hash cache when a new hash is obtained from a blob.

private readonly Dictionary _syncFromBlobHashCache = new Dictionary();
private string GetFileHashFromCache(string filePath)
{
    if (_syncFromBlobHashCache.ContainsKey(filePath))
    {
        return _syncFromBlobHashCache[filePath];
    }
    else
    {
        using (FileStream file = new FileStream(filePath, FileMode.Open))
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            string fileHash = Convert.ToBase64String(md5.ComputeHash(file));
            _syncFromBlobHashCache[filePath] = fileHash;

            return fileHash;
        }
    }
}

private void UpdateFileHash(string filePath, string fileHash)
{
    _syncFromBlobHashCache[filePath] = fileHash;
}

GetAllFilesInStartFolder

This method returns all files in the start folder given by the LocalFolder property. It ToLowers all file paths. This is done because blobs names are case sensitive, so when we compare paths returned from the method we want to compare on all lower cased paths. When comparing paths we also use the GetLocalPath method which translates a blob path to a local path and also ToLowers the result. 

private IEnumerable GetAllFilesInStartFolder()
{
    Queue foldersToProcess = new Queue();
    foldersToProcess.Enqueue(LocalFolder);

    while (foldersToProcess.Count > 0)
    {
        string currentFolder = foldersToProcess.Dequeue();
        foreach (string subFolder in Directory.GetDirectories(currentFolder))
        {
            foldersToProcess.Enqueue(subFolder);
        }

        foreach (string filePath in Directory.GetFiles(currentFolder))
        {
            yield return filePath.ToLower();
        }
    }
}

GetLocalPath

Returns the local path of the given blob using the LocalFolder property as base folder. It ToLowers the result so we only compare all lower cased paths due to that blob names are case sensitive. 

private string GetLocalPath(CloudBlob blob)
{
    /* Path should only use \ and no /  */
    string path = blob.Uri.LocalPath.Remove(blob.Container.Name.Length + 2).Replace('/', '\\');

    /* Blob names are case sensitive, so when we check local */
    /* filenames agains blob names we tolower all of it */
    return Path.Combine(LocalFolder, path).ToLower();
}

Tags:

.NET | Azure | C# | Blob

Azure cloud blob property vs metadata

by ingvar 5. juni 2011 21:45

Both properties (some of them) and the metadata collection of a blob can be used to store meta data for a given blob. But there are a small differences between them. When working with the blob storage, the number of HTTP REST request plays a significant role when it comes to performance. The number of request becomes very important if the blob storage contains a lot of small files. There are at least three properties found in the CloudBlob.Properties property that can be used freely. These are ContentType, ContentEncoding and ContentLanguage. These can hold very large strings! I have tried testing with a string containing 100.000 characters and it worked. They could possible hold a lot more, but hey, 100.000 is a lot! So all three of them can be used to hold meta data.
So, what are the difference of using these properties and using the metadata collection? The difference lies in when they get populated. This is best illustrated by the following code:

CloudBlobContainer container; /* Initialized assumed */
CloudBlob blob1 = container.GetBlobReference("MyTestBlob.txt");
blob1.Properties.ContentType = "MyType";
blob1.Metadata["Meta"] = "MyMeta";
blob1.UploadText("Some content");

CloudBlob blob2 = container.GetBlobReference("MyTestBlob.txt");
string value21 = blob2.Properties.ContentType; /* Not populated */
string value22 = blob2.Metadata["Meta"]; /* Not populated */

CloudBlob blob3 = container.GetBlobReference("MyTestBlob.txt");
blob3.FetchAttributes();
string value31 = blob3.Properties.ContentType; /* Populated */
string value32 = blob3.Metadata["Meta"]; /* Populated */

CloudBlob blob4 = (CloudBlob)container.ListBlobs().First();
string value41 = blob4.Properties.ContentType; /* Populated */
string value42 = blob4.Metadata["Meta"]; /* Not populated */

BlobRequestOptions options = new BlobRequestOptions
  {
     BlobListingDetails = BlobListingDetails.Metadata
  };
CloudBlob blob5 = (CloudBlob)container.ListBlobs(options).First();
string value51 = blob5.Properties.ContentType; /* Populated */
string value52 = blob5.Metadata["Meta"]; /* populated */

 

The difference is when using ListBlobs on a container or blob directory and the values of the BlobRequestOptions object. It might not seem to be a big difference, but imagine that there are 10.000 blobs all with a meta data string value with a length of 100 characters. That sums to 1.000.000 extra data to send when listing the blobs. So if the meta data is not used every time you do a ListBlobs call, you might consider moving it to the Metadata collection. I will investigate more into the performance of these methods of storing meta data for a blob in a later blog post.

Tags:

.NET | Azure | C# | Blob

About the author

Martin Ingvar Kofoed Jensen

Architect and Senior Developer at Composite on the open source project Composite C1 - C#/4.0, LINQ, Azure, Parallel and much more!

Follow me on Twitter

Read more about me here.

Read press and buzz about my work and me here.

Stack Overflow

Month List