Ground-Up with the Bot Framework

It seems I can’t write about code these days without a warmup rant. So feel free to jump directly to the next section if you like. But where’s the fun in that?

My mixed (ok negative) feelings about “quickstarts” go back all the way to the invention of “Wizards” at Microsoft in the early 1990s. They serve a worthy goal, guiding users through a complex process to deliver value quickly. But even in those earliest days, it was clear that the reality was little more than a cheap dopamine hit, mostly good for demos and maybe helping show what’s possible. The problem comes down to two (IMNSHO) fatal flaws:

First, quickstarts abandon users deep in the jungle with a great SUV but no map or driver’s license. Their whole reason to exist is to avoid annoying details and optionality, but that means that the user has no understanding of the context in which the solution was created. How do you change it? What dependencies does it require? How does it fit into your environment? Does it log somewhere? Is it secured appropriately for production? How much will it cost to run? The end result is that people constantly put hacked-up versions of “Hello World” into production and pay for it later when they have no idea what is really going on.

Second, they make developers even lazier than they naturally are anyways. Rather than start with the basics, quickstarts skip most of the hard stuff and lock in decisions that any serious user will have to make for themselves. If this was the start of the documentation, that’d be fine — but it’s usually the end. Instead of more context, the user just gets dropped unceremoniously into auto-generated references that don’t provide any useful narrative. Even worse, existence of the quickstart becomes an excuse for a sloppy underlying interface design (whether that’s an API or menus and dialogs) — e.g., why worry about the steering wheel if people take the test-drive using autopilot?

Anyways, this is really just a long-winded way to say that the Bot Framework quickstart is pretty useless, especially if you’re using Java. Let’s do better, shall we?

What is the Bot Framework?

There are a bunch of SDKs and builders out there for creating chatbots. The Microsoft Bot Framework has been around for a while (launched out of Microsoft Research in 2016) and seems to have pretty significant mindshare. Actually the real momentum really seems to be with no-code or low-code options, which makes sense given how many bots are shallow marketing plays — but I’m jumping right into the SDK here because that’s way more fun, and it’s my blog.

The framework is basically a big normalizer. Your bot presents a standardized HTTPS interface, using the Bot Framework SDK to help manage the various structures and state. The Azure Bot Service acts as a hub, translating messages in and out of various channels (Teams, Slack, SMS, etc.) and presenting them to your interface. Honestly, that’s basically the whole thing. There are additional services to support language understanding and speech-to-text and stuff like that, but it’s all additive to the basic framework.

WumpusBot and RadioBot

I introduced WumpusBot in my last post … basically a chatbot that lets you play a version the classic 1970s game Hunt the Wumpus. The game logic is adapted from a simplified version online and lives in Wumpus.java, but I won’t spend much time on that. I’ve hooked WumpusBot up to Twillio SMS, so you can give it a try by texting “play” to 706-943-3865.

The project also contains RadioBot, a second chatbot that knows how to interact with the Shutdown Radio service I’ve talked about before. This one is hooked up to Microsoft Teams and includes some slightly fancier interactions — I’ll talk about that after we get a handle on the basics.

Build Housekeeping

All this is hosted in an Azure Function App — so let’s start there. The code is on github. You’ll need git, mvn and a JDK. Build like this:

git clone https://github.com/seanno/shutdownhook.git
cd shutdownhook/toolbox
mvn clean package install
cd ../radio/azure
mvn clean package

To run you’ll need two Cosmos Containers (details in Shutdown Radio on Azure, pay attention to the Managed Identity stuff) and a local.settings.json file with the keys COSMOS_ENDPOINT, COSMOS_ DATABASE, COSMOS_CONTAINER and COSMOS_CONTAINER_WUMPUS. You should then be able to run locally using “mvn azure-functions:run.”

Getting a little ahead of myself, but to deploy to Azure you’ll need to update the “functionAppName” setting in pom.xml; “mvn azure-functions:deploy” should work from there assuming you’re logged into the Azure CLI.

The Endpoint

Your bot needs to expose an HTTPS endpoint that receives JSON messages via POST. The Java SDK would really like you to use Spring Boot for this, but it 100% isn’t required. I’ve used a standard Azure Function for mine; that code lives in Functions.java. It really is this simple:

  1. Deserialize the JSON in the request body into an Activity object (line 68).
  2. Pull out the “authorization” header (careful of case-sensitivity) sent by the Bot Framework (line 71).
  3. Get an instance of your “bot” (line 52). This is the message handler and derives from ActivityHandler in WumpusBot.java.
  4. Get an instance of your “adapter.” This is basically the framework engine; we inherit ours from BotFrameworkHttpAdapter in Adapter.java.
  5. Pass all the stuff from steps 1, 2 and 3 to the processIncomingActivity method of your Adapter (line 74).
  6. Use the returned InvokeResponse object to send an HTTPS status and JSON body back down the wire.

All of which is to say, “receive some JSON, do a thing, send back some JSON.” Wrapped up in a million annoying Futures.

The Adapter

The BotAdapter acts as ringmaster for the “do a thing” part of the request, providing helpers and context for your Bot implementation.

BotFrameworkHttpAdapter is almost sufficient to use as-is; the only reason I needed to extend it was to provide a custom Configuration object. By default, the object looks for configuration information in a properties file. This isn’t a bad assumption for Java apps, but in Azure Functions it’s way easier to keep configuration in the environment (via local.settings.json during development and the “Configuration” blade in the portal for production). EnvConfiguration in Adapter.java handles this, and then is wired up to our Adapter at line 34.

The adapter uses its configuration object to fetch the information used in service-to-service authentication. When we register our Bot with the Bot Service, we get an application id and secret. The incoming authentication header (#2 above) is compared to the “MicrosoftAppId” and “MicrosoftAppSecret” values in the configuration to ensure the connection is legitimate.

Actually, EnvConfiguration is more complicated than would normally be required, because I wanted to host two distinct bots within the same Function App (WumpusBot and RadioBot). This requires a way to keep multiple AppId and AppSecret values around, but we only have one System.env() to work with. The “configSuffix” noise in my class takes care of that segmentation.

There are a few other “providers” you can attach to your adapter if needed. The most common of these is the “AuthenticationProvider” that helps manage user-level OAuth, for example if you want your bot to access a user’s personal calendar or send email on their behalf. I didn’t have any need for this, so left the defaults alone.

Once you get all this wired up, you can pretty much ignore it.

The Bot

Here’s where the fun stuff starts. The Adapter sets up a TurnContext object and passes it to the onTurn method of your Bot implementation. The default onTurn handler is really just a big switch on the ActivityType (MESSAGE, TYPING, CONVERSATION_UPDATE, etc.) that farms out calls to type-specific handlers. Your bot can override any of these to receive notifications on various events.

The onMessageActivity method is called whenever your bot receives a (duh) message. For simple text interactions, simply call turnContext.getActivity().getText() to read the incoming text, and turnContext.sendActivity(MessageFactory.text(responseString)) to send back a response.

The Bot Framework has tried to standardize on markdown formatting for text messages, but support is spotty. For example Teams and WebChat work well, but Skype and SMS just display messages as raw text. Get used to running into this a lot — normalization across channels is pretty hit or miss, so for anything complex you can expect to be writing channel-specific code. This goes for conversation semantics as well. For example from my experience so far, the onMembersAdded activity:

  • Is called in Teams right away when the bot enters a channel or a new member joins;
  • Is called in WebChat only after the bot receives an initial message from the user; and
  • Is never called for Twilio SMS conversations at all.

Managing State

Quirks aside, for a stateless bot, that’s really about all there is to it. But not all bots are stateless — some of the most useful functionality emerges from a conversation that develops over time (even ELIZA needed a little bit of memory!) To accomplish that you’ll use the significantly over-engineered “BotState” mechanism you see in use at WumpusBot.java line 57. There are three types of state:

All of these are the same except for the implementation of getStorageKey, which grovels around in the turnContext to construct an appropriate key to identify the desired scope.

The state object delegates actual storage to an implementation of a CRUD interface. The framework implements two versions, one in-memory and one using Cosmos DB. The memory one is another example of why quickstarts are awful — it’s easy, but is basically never appropriate for the real world. It’s just a shortcut to make the framework look simpler than it really is.

The Cosmos DB implementation is fine except that it authenticates using a key. I wanted to use the same Managed Identity I used elsewhere in this app already, so I implemented my own in Storage.java. I cheated a little by ignoring “ETag” support to manage versioning conflicts, but I just couldn’t make myself believe that this was going to be a problem. (Fun fact: Cosmos lets you create items with illegal id values, but then you can’t ever read or delete them without some serious hackage. That’s why safeKey exists.)

Last and very important if you’re implementing your own Storage — notice the call to enableDefaultTyping on the Jackson ObjectMapper. Without this setting, the ObjectMapper serializes to JSON without type information. This is often OK because you’re either providing the type directly or the OM can infer reasonably. But the framework’s state map is polymorphic (it holds Objects), so these mechanisms can’t do the job. Default typing stores type info in the JSON so you get back what you started with.

Once you have picked your scope and set up Storage, you can relatively easily fetch and store state objects (in my situation a WumpusState) with this pattern:

  1. Allocate a BotState object in your Bot singleton (line 39).
  2. Call getProperty in your activity handler to set up a named property (line 57).  
  3. Fetch the state using the returned StatePropertyAccessor and (ugh) wait on the Future (lines 58-60). Notice the constructor here which is used to initialize the object on first access.  
  4. Use the object normally.
  5. Push changes back to storage before exiting your handler (line 68). Change tracking is implicit, so be sure to update state in the specific object instance you got in step #3. This is why Wumpus.newGame() never reallocates a WumpusState once it’s attached.

Testing your Bot Locally

Once you have your Function App running and responding to incoming messages, you can test it out locally using the Bot Framework Emulator. The Emulator is a GUI that can run under Windows, Mac or Linux (in X). You provide your bot’s endpoint URL (e.g., http://localhost:7071/wumpus for the WumpusBot running locally with mvn azure-functions:run) and the app establishes a conversation that includes a bunch of nifty debugging information.

Connecting to the Bot Service

The emulator is nice because you can manage things completely locally. Testing with the real Bot Service gets a little more complicated, because it needs to access an Internet-accessible endpoint.

All of the docs and tutorials have you do this by running yet another random tool. ngrok is admittedly kind of cute — it basically just forwards a port from your local machine to a random url like https://92832de0.ngrok.io. The fact that it can serve up HTTPS is a nice bonus. So if you’re down for that, by all means go for it. But I was able to do most of my testing with the emulator, so by the time I wanted to see it live, I really just wanted to see it live. Deploying the function to Azure is easy and relatively quick, so I just did that and ended up with my real bot URL: https://shutdownradio.azurewebsites.net/wumpus.

The first step is to create the Bot in Azure. Search the portal for “Azure Bot” (it shows up in the Marketplace section). Give your bot a unique handle (I used “wumpus”) and pick your desired subscription and resource group (fair warning — most of all this can be covered under your free subscription plan, but you might want to poke around to be sure you know what you’re getting into). Java bots can only be “Multi Tenant” so choose that option and let the system create a new App ID.

Once creation is complete, paste your bot URL into the “Messaging Endpoint” box. Next copy  down the “Microsoft App Id” value and click “Manage” and then “Certificates & secrets.” Allocate a new client secret since you can’t see the value of the one they created for you (doh). Back in the “Configuration” section of your Function app, add these values (remember my comment about “configSuffix” at the beginning of all this):

  • MicrosoftAppId_wumpus (your app id)
  • MicrosoftAppSecret_wumpus (your app secret)
  • MicrosoftAppType_wumpus (“MultiTenant” with no space)

If you want to run RadioBot as well, repeat all of this for a new bot using the endpoint /bot and without the “_wumpus” suffixes in the configuration values.

Congratulations, you now have a bot! In the Azure portal, you can choose “Test in Web Chat” to give it a spin. It’s pretty easy to embed this chat experience into your web site as well (instructions here).

You can use the “Channels” tab to wire up your bot to additional services. I hooked Wumpus up to Twilio SMS using the instructions here. In brief:

  • Sign up for Twilio and get an SMS number.
  • Create a “TwiML” application on their portal and link it to the Bot Framework using the endpoint https://sms.botframework.com/api/sms.
  • Choose the Twilio channel in the Azure portal and paste in your TwiML application credentials.

That’s it! Just text “play” to 706-943-3865 and you’re off to the races.

Bots in Microsoft Teams

Connecting to Teams is conceptually similar to SMS, just a lot more fiddly.

First, enable the Microsoft Teams channel in your Bot Service configuration. This is pretty much just a checkbox and confirmation that this is a Commercial, not Government, bot.

Next, bop over to the Teams admin site at https://admin.teams.microsoft.com/ (if you’re not an admin you may need a hand here). Under “Teams Apps” / “Setup Policies” / “Global”, make sure that the “Upload custom apps” slider is enabled. Note if you want to be more surgical about this, you can instead add a new policy with this option just for developers and assign it to them under “Manage Users.”

Finally, head over to https://dev.teams.microsoft.com/apps and create a new custom app. There are a lot of options here, but only a few are required:

  • Under “Basic Information”, add values for the website, privacy policy and terms of use. Any URL is fine for now, but they can’t be empty, or you’ll get mysterious errors later.
  • Under “App Features”, add a “Bot.” Paste your bot’s “Microsoft App Id” (the same one you used during the function app configuration) into the “Enter a Bot ID” box. Also check whichever of the “scope” checkboxes are interesting to you (I just checked them all).

Save all this and you’re ready to give it a try. If you want a super-quick dopamine hit, just click the “Preview in Teams” button. If you want to be more official about it, choose “Publish” / “Publish to org” and then ask your Teams Admin to approve the application for use. If you’re feeling really brave, you can go all-in and publish your bot to the Teams Store for anyone to use, but that’s beyond my pay grade here. Whichever way you choose to publish, once the app is in place you can start a new chat with your bot by name, or add them to a channel by typing @ and selecting “Get Bots” in the resulting popup. Pretty cool!

A caveat about using bots in channels: your bot will only receive messages in which they are @mentioned, which can be slightly annoying but net net probably makes sense. Unfortunately though, it is probably going to mess up your message parsing, because the mention is included in the message text (e.g., “<at>botname</at> real message.”). I’ve coded RadioBot to handle this by stripping out anything between “at” markers at line 454. Just another way in which you really do need to know what channel you’re dealing with.

Teams in particular has a whole bunch of other capabilities and restrictions beyond what you’ll find in the vanilla Bot Framework. It’s worth reading through their documentation and in particular being aware of the Teams-specific stuff you’ll find in TeamsChannelData.

We made it!

Well that was a lot; kind of an anti-quickstart. But if you’ve gotten this far, you have a solid understanding of how the Bot Framework works and how the pieces fit together, start to finish. There is a bunch more we could dig into (for instance check out the Adaptive Card interfaces in RadioBot here and here) — but we don’t need to, because you’ll be able to figure it out for yourself. Teach a person to fish or whatever, I guess.

Anyhoo, if you do anything cool with this stuff, I’d sure love to hear about it, and happy to answer questions if you get stuck as well. Beyond that, I hope you’ll enjoy some good conversations with our future robot overlords, and I’ll look forward to checking in with another post soon!

Shutdown Radio on Azure

Back about a year ago when I was playing with ShutdownRadio, I ranted a bit about my failed attempt to implement it using Azure Functions and Cosmos. Just to recap, dependency conflicts in the official Microsoft Java libraries made it impossible to use these two core Azure technologies together — so I punted. I planned to revisit an Azure version once Microsoft got their sh*t together, but life moved on and that never happened.

Separately, a couple of weeks ago I decided I should learn more about chatbots in general and the Microsoft Bot Framework in particular. “Conversational” interfaces are popping up more and more, and while they’re often just annoyingly obtuse, I can imagine a ton of really useful applications. And if we’re ever going to eliminate unsatisfying jobs from the world, bots that can figure out what our crazily imprecise language patterns mean are going to have to play a role.

No joke, this is what my Bellevue workbench looks like right now, today.

But heads up, this post isn’t about bots at all. You know that thing where you want to do a project, but you can’t do the project until the workbench is clean, but you can’t clean up the workbench until you finish the painting job sitting on the bench, but you can’t finish that job until you go to the store for more paint, but you can’t go to the store until you get gas for the car? Yeah, that’s me.

My plan was to write a bot for Microsoft Teams that could interact with ShutdownRadio and make it more natural/engaging for folks that use Teams all day for work anyways. But it seemed really silly to do all of that work in Azure and then call out to a dumb little web app running on my ancient Rackspace VM. So that’s how I got back to implementing ShutdownRadio using Azure Functions. And while it was generally not so bad this time around, there were enough gotchas that I thought I’m immortalize them for Google here before diving into the shiny new fun bot stuff. All of which is to say — this post is probably only interesting to you if you are in fact using Google right now to figure out why your Azure code isn’t working. You have been warned.

A quick recap of the app

The idea of ShutdownRadio is for people to be able to curate and listen to (or watch I suppose) YouTube playlists “in sync” from different physical locations. There is no login and anyone can add videos to any channel — but there is also no list of channels, so somebody has to know the channel name to be a jack*ss. It’s a simple, bare-bones UX — the only magic is in the synchronization that ensures everybody is (for all practical purposes) listening to the same song at the same time. I talked more about all of this in the original article, so won’t belabor it here.

For your listening pleasure, I did migrate over the “songs by bands connected in some way to Seattle” playlist that my colleagues at Adaptive put together in 2020. Use the channel name “seattle” to take it for a spin; there’s some great stuff in there!

Moving to Azure Functions

The concept of Azure Functions (or AWS Lambda) is pretty sweet — rather than deploying code to servers or VMs directly, you just upload “functions” (code packages) to the cloud, configure the endpoints or “triggers” that allow users to execute them (usually HTTP URLs), and let your provider figure out where and how to run everything. This is just one flavor of the “serverless computing” future that is slowly but surely becoming the standard for everything (and of course there are servers, they’re just not your problem). ShutdownRadio exposes four of these functions:

  • /home simply returns the static HTML page that embeds the video player and drives the UX. Easy peasy.
  • /channel returns information about the current state of a channel, including the currently-playing video.
  • /playlist returns all of the videos in the channel.
  • /addVideo adds a new video to the channel.

Each of these routes was originally defined in Handlers.java as HttpHandlers, the construct used by the JDK internal HttpServer. After creating the Functions project using the “quickstart” maven archetype, lifting these over to Azure Functions in Functions.java was pretty straightforward. The class names are different, but the story is pretty much the same.

Routes and Proxies

My goal was to make minimal changes to the original code — obviously these handlers needed to change, as well as the backend store (which we’ll discuss later), but beyond that I wanted to leave things alone as much as possible. By default Azure Functions prepend “/api/” to HTTP routes, but I was able to match the originals by turfing that in the host.json configuration file:

"extensions": {
       "http": {
             "routePrefix": ""
       }
}

A trickier routing issue was getting the “root” page to work (i.e., “/” instead of “/home“). Functions are required to have a non-empty name, so you can’t just use “” (or “/” yes I tried). It took a bunch of digging but eventually Google delivered the goods in two parts:

  1. Function apps support “proxy” rules via proxies.json that can be abused to route requests from the root to a named function (note the non-obvious use of “localhost” in the backendUri value to proxy routes to the same application).
  2. The maven-resources-plugin can be used in pom.xml to put proxies.json in the right place at packaging time so that it makes it up to the cloud.

Finally, the Azure portal “TLS/SSL settings” panel can be used to force all requests to use HTTPS. Not necessary for this app but a nice touch.

All of this seems pretty obscure, but for once I’m inclined to give Microsoft a break. Functions really aren’t meant to implement websites — they have Azure Web Apps and Static Web Apps for that. In this case, I just preferred the Functions model — so the weird configuration is on me.

Moving to Cosmos

I’m a little less sanguine about the challenges I had changing the storage model from a simple directory of files to Cosmos DB. I mean, the final product is really quite simple and works well, so that’s cool. But once again I ran into lazy client library issues and random inconsistencies all along the way.

There are a bunch of ways to use Cosmos, but at heart it’s just a super-scalable NoSQL document store. Honestly I don’t really understand the pedigree of this thing — back in the day “Cosmos” was the in-house data warehouse used to do analytics for Bing Search, but that grew up super-organically with a weird, custom batch interface. I can’t imagine that the public service really shares code with that dinosaur, but as far as I can tell it’s not a fork of any of the big open source NoSQL projects either. So where did it even come from — ground up? Yeesh, only at Microsoft.

Anyhoo, after creating a Cosmos “account” in the Azure portal, it’s relatively easy to create databases (really just namespaces) and containers within them (more like what I could consider databases, or maybe big flexible partitioned tables). Containers hold items which natively are just JSON documents, although they can be made to look like table rows or graph elements with the different APIs.

Access using a Managed Identity

One of the big selling points (at least for me) of using Azure for distributed systems is its support for managed identities. Basically each service (e.g., my Function App) can have its own Active Directory identity, and this identity can be given rights to access other services (e.g., my Cosmos DB container). These relationships completely eliminate the need to store and manage service credentials — everything just happens transparently without any of the noise or risk that comes with traditional service-to-service authentication. It’s beautiful stuff.

Of course, it can be a bit tricky to make this work on dev machines — e.g., the Azure Function App emulator doesn’t know squat about managed identities (it has all kinds of other problems too but let’s focus here). The best (and I think recommended?) approach I’ve found is to use the DefaultAzureCredentialBuilder to get an auth token. The pattern works like this:

  1. In the cloud, configure your service to use a Managed Identity and grant access using that.
  2. For local development, grant your personal Azure login access to test resources — then use “az login” at the command-line to establish credentials on your development machine.
  3. In code, let the DefaultAzureCredential figure out what kind of token is appropriate and then use that token for service auth.

The DefaultAzureCredential iterates over all the various and obtuse authentication types until it finds one that works — with production-class approaches like ManagedIdentityCredential taking higher priority than development-class ones like AzureCliCredential. Net-net it just works in both situations, which is really nice.

Unfortunately, admin support for managed identities (or really any role-based access) with Cosmos is just stupid. There is no way to set it up using the portal — you can only do it via the command line with the Azure CLI or Powershell. I’ve said it before, but this kind of thing drives me absolutely nuts — it seems like every implementation is just random. Maybe it’s here, maybe it’s there, who knows … it’s just exhausting and inexcusable for a company that claims to love developers. But whatever, here’s a snippet that grants an AD object read/write access to a Cosmos container:

az cosmosdb sql role assignment create \
       --account-name 'COSMOS_ACCOUNT' \
       --resource-group 'COSMOS_RESOURCE_GROUP' \
       --scope '/dbs/COSMOS_DATABASE/colls/COSMOS_CONTAINER' \
       --principal-id 'MANAGED_IDENTITY_OR_OTHER_AD_OBJECCT' \
       --role-definition-id '00000000-0000-0000-0000-000000000002'

The role-definition id there is a built-in CosmosDB “contributor” role that grants read and write access. The “scope” can be omitted to grant access to all databases and containers in the account, or just truncated to /dbs/COSMOS_DATABASE for all containers in the database. The same command can be used with your Azure AD account as the principal-id.

Client Library Gotchas

Each Cosmos Container can hold arbitrary JSON documents — they don’t need to all use the same schema. This is nice because it meant I could keep the “channel” and “playlist” objects in the same container, so long as they all had unique identifier values. I created this identifier by adding an internal “id” field on each of the objects in Model.java — the analog of the unique filename suffix I used in the original version.

The base Cosmos Java API lets you read and write POJOs directly using generics and the serialization capabilities of the Jackson JSON library. This is admittedly cool — I use the same pattern often with Google’s Gson library. But here’s the rub — the library can’t serialize common types like the ones in the java.time namespace. In and of itself this is fine, because Jackson provides a way to add serialization modules to do the job for unknown types. But the recommended way of doing this requires setting values on the ObjectMapper used for serialization, and that ObjectMapper isn’t exposed by the client library for public use. Well technically it is, so that’s what I did — but it’s a hack using stuff inside the “implementation” namespace:

log.info("Adding JavaTimeModule to Cosmos Utils ObjectMapper");
com.azure.cosmos.implementation.Utils.getSimpleObjectMapper().registerModule(new JavaTimeModule());

Side node: long after I got this working, I stumbled onto another approach that uses Jackson annotations and doesn’t require directly referencing private implementation. That’s better, but it’s still a crappy, leaky abstraction that requires knowledge and exploitation of undocumented implementation details. Do better, Microsoft!

Pop the Stack

Minor tribulations aside, ShutdownRadio is now happily running in Azure — so mission accomplished for this post. And when I look at the actual code delta between this version and the original one, it’s really quite minimal. Radio.java, YouTube.java and player.html didn’t have to change at all. Model.java took just a couple of tweaks, and I could have even avoided those if I were being really strict with myself. Not too shabby!

Now it’s time to pop this task off of the stack and get back to the business of learning about bots. Next stop, ShutdownRadio in Teams …and maybe Skype if I’m feeling extra bold. Onward!

Cold Storage on Azure

As the story goes, Steve Jobs once commissioned a study to determine the safest long-term storage medium for Apple’s source code. After evaluating all the most advanced technologies of the day (LaserDisc!), the team ended up printing it all out on acid-free paper to be stored deep within the low-humidity environment at Yucca Mountain — the theory being that our eyes were the most likely data retrieval technology to survive the collapse of civilization. Of course this is almost certainly false, but I love it anyways. Just like the (also false) Soviet-pencils-in-space story, there is something very Jedi about simplicity outperforming complexity. If you need me, I’ll be hanging out in the basement with Mike Mulligan and Mary Anne.

Image credit Wikipedia

Anyways, I was reminded of the Jobs story the other day because long-term data storage is something of a recurring challenge in the Nolan household. In the days of hundreds of free gigs from consumer services, you wouldn’t think this would be an issue, and yet it is. In particular my wife takes a billion pictures (her camera takes something like fifty shots for every shutter press), and my daughter has created an improbable tidal wave of video content.

Keeping all this stuff safe has been a decades-long saga including various server incarnations, a custom-built NAS in the closet, the usual online services, and more. They all have fatal flaws, from reliability to cost to usability. Until very recently, the most effective approach was a big pile of redundant drives in a fireproof safe. It’s honestly not a terrible system; you can get 2TB for basically no money these days, so keeping multiple copies of everything isn’t a big deal. Still not great though — mean time to failure for both spinning disks and SSD remains sadly low — so we need to remember to check them all a couple of time each year to catch hardware failures. And there’s always more. A couple of weeks ago, as my daughter’s laptop was clearly on the way out, she found herself trying to rescue yet more huge files that hadn’t made it to the safe.

Enter Glacier (and, uh, “Archive”)

It turns out that in the last five years or so a new long-term storage approach has emerged, and it is awesome.

Object (file) storage has been a part of the “cloud” ever since there was a “cloud” — Amazon calls their service S3; Microsoft calls theirs Blob Storage.  Conceptually these systems are quite simple: files are uploaded to and downloaded from virtual drives (“buckets” for Amazon, “containers” for Azure) using more-or-less standard web APIs. The files are available to anyone anywhere that has the right credentials, which is super-handy. But the real win is that files stored in these services are really, really unlikely to be lost due to hardware issues. Multiple copies of every file are stored not just on multiple drives, but in multiple regions of the world — so they’re good even if Lex Luthor does manage to cleave off California into the ocean (whew). And they are constantly monitored for hardware failure behind the scenes. It’s fantastic.

But as you might suspect, this redundancy doesn’t come for free. Storing a 100 gigabyte file in “standard” storage goes for about $30 per year (there are minor differences between services and lots of options that can impact this number, but it’s reasonably close), which is basically the one-and-done cost of a 2 terabyte USB stick! This premium can be very much worth it for enterprises, but it’s hard to swallow for home use.

Ah, but wait. These folks aren’t stupid, and realized that long-term “cold” storage is its own beast. Once stored, these files are almost never looked at again — they just sit there as a security blanket against disaster. By taking them offline (even just by turning off the electricity to the racks), they could be stored much more cheaply, without sacrificing any of the redundancy. The tradeoff is only that if you do need to read the files, bringing them back online takes some time (about half a day generally) — not a bad story for this use case. Even better, the teams realized that they could use the same APIs for both “active” and “cold” file operations — and even move things between these tiers automatically to optimize costs in some cases.

Thus was born Amazon Glacier and the predictably-boringly-named Azure Archive Tier. That same 100GB file in long-term storage costs just $3.50 / year … a dramatically better cost profile, and something I can get solidly behind for family use. Woo hoo!

But Wait

The functionality is great, and the costs are totally fine. So why not just let the family loose on some storage and be done with it? As we often discover, the devil is in the user experience. Both S3 and Blob Service are designed as building blocks for developers and IT nerds — not for end users. The native admin tools are a non-starter; they exist within an uber-complex web of cloud configuration tools that make it very easy to do the wrong thing. There are a few hideously-complicated apps that all look like 1991 FTP clients. And there are a few options for using the services to manage traditional laptop backups, but they all sound pretty sketchy and that’s not our use case here anyways.

Sounds like a good excuse to write some code! I know I’m repeating myself but … whether it’s your job or not, knowing how to code is the twenty-first century superpower. Give it a try.

The two services are basically equivalent; I chose to use Azure storage because our family is already deep down the Microsoft rabbit hole with Office365. And this time I decided to bite the bullet and deploy the user-facing code using Azure as well — in particular an Azure Static Web App using the Azure Storage Blob client library for JavaScript. You can create a “personal use” SWA for free, which is pretty sweet. Unfortunately, Microsoft’s shockingly bad developer experience strikes again and getting the app to run was anything but “sweet.” At its height my poor daughter was caught up in a classic remote-IT-support rodeo, which she memorialized in true Millennial Meme form.

Anyhoo — the key features of an app to support our family use case were pretty straightforward:

  1. Simple user experience, basically a “big upload button”.
  2. Login using our family Office365 accounts (no new passwords).
  3. A segregated personal space for each user’s files.
  4. An “upload” button to efficiently push files directly into the Archive tier.
  5. A “thaw” button to request that a file be copied to the Cool tier so it can be downloaded.
  6. A “download” button to retrieve thawed files.
  7. A “delete” button to remove files from either tier.

One useful feature I skipped — given that the “thawing” process can take about fifteen hours, it would be nice to send an email notification when that completes. I haven’t done this yet, but Azure does fire events automatically when rehydration is complete — so it’ll be easy to add later.

For the rest of this post, we’ll decisively enter nerd-land as I go into detail about how I implemented each of these. Not a full tutorial, but hopefully enough to leave some Google crumbs for folks trying to do similar stuff. All of the code is up on github in its own repository; feel free to use any of it for your own purposes — and let me know if I can help with anything there.

Set up the infrastructure

All righty. First you’ll need an Azure Static Web App. SWAs are typically deployed directly from github; each time you check in, the production website will automatically be updated with the new code. Set up a repo and the Azure SWA using this quickstart (use the personal plan). Your app will also need managed APIs — this quickstart shows how to add and test them on your local development machine. These quickstarts both use Visual Studio Code extensions — it’s definitely possible to do all of this without VSCode, but I don’t recommend it. Azure developer experience is pretty bad; sticking to their preferred toolset at least minimizes unwelcome surprises.

You’ll also need a Storage Account, which you can create using the Azure portal. All of the defaults are reasonable, just be sure to pick the “redundancy” setting you want (probably “Geo-redundant storage”). Once the account has been created, add a CORS rule (in the left-side navigation bar) that permits calls from your SWA domain (you’ll find this name in the “URL” field of the overview page for the SWA in the Azure portal).

Managing authentication with Active Directory

SWAs automatically support authentication using accounts from Active Directory, Github or Twitter (if you choose the “standard” pricing plan you can add your own). This is super-nice and reason alone to use SWA for these simple little sites — especially for my case where the users in question are already part of our Azure AD through Office365. Getting it to work correctly, though, is a little tricky.

Code in your SWA can determine the users’ logged-in status in two ways: (1) from the client side, make an Ajax call to the built-in route /.auth/me, which returns a JSON object with information about the user, including their currently-assigned roles; (2) from API methods, decode the x-ms-client-principal header to get the same information.

By default, all pages in a SWA are open for public access and the role returned will be “anonymous”. Redirecting a user to the built-in route /.auth/aad will walk them through a standard AD login experience. By default anyone with a valid AD account can log in and will be assigned the “authenticated” role. If you’re ok with that, then good enough and you’re done. If you want to restrict your app only to specific users (as I did), open up the Azure portal for your SWA and click “Role management” in the left-side navigation bar. From here you can “invite” specific users and grant them custom roles (I used “contributor”) — since only these users will have your roles, you can filter out the riff-raff.

Next you have to configure routes in the file staticwebapp.config.json in the same directory with your HTML files to enforce security. There’s a lot of ways to do this and it’s a little finicky because your SWA has some hidden routes that you don’t want to accidentally mess with. My file is here; basically it does four things:

  1. Allows anyone to view the login-related pages (/.auth/*).
  2. Restricts the static and api files to users that have my custom “contributor” role.
  3. Redirects “/” to my index.html page.
  4. Redirects to the AD auth page when needed to prompt login.

I’m sure there’s a cleaner way to make all this happen, but this works and makes sense to me, so onward we go.

Displaying files in storage

The app displays files in two tables: one for archived files (in cold storage) and one for active ones that are either ready to download or pending a thaw. Generating the actual HTML for these tables happens on the client, but the data is assembled at the server. The shared “Freezer” object knows how to name the user’s personal container from their login information and ensure it exists. The listFiles method then calls listBlobsFlat to build the response object.

There are more details on the “thawing” process below, but note that if a blob is in the middle of thawing we identify it using the “archiveStatus” property on the blob. Other than that, this is a pretty simple iteration and transformation. I have to mention again just how handy JSON is these days — it’s super-easy to cons up objects and return them from API methods.

Uploading

Remember the use case here is storing big files — like tens to hundreds of gigabytes big. Uploading things like that to the cloud is a hassle no matter how you do it, and browsers in particular are not known for their prowess at the job. But we’re going to try it anyways.

In the section above, the browser made a request to our own API (/api/listFiles), which in turn made requests to the Azure storage service. That works fine when the data packages are small, but when you’re trying to push a bunch of bytes, having that API “middleman” just doesn’t cut it. Instead, we want to upload the file directly from the browser to Azure storage. This is why we had to set up a CORS rule for the storage account, because otherwise the browser would reject the “cross-domain” request to https://STORAGE_ACCT.blob.core.windows.net where the files live.

no preflight cache for PUT, so sad

The same client library that we’ve been using from the server (node.js) environment will work in client-side JavaScript as well — sort of. Of course because it’s a Microsoft client library, they depend on about a dozen random npm packages (punycode, tough-cookie, universalify, the list goes on), and getting all of this into a form that the browser can use requires a “bundler.” They actually have some documentation on this, but it leaves some gaps — in particular, how best to use the bundled files as a library. I ended up using webpack to make the files, with a little index.js magic to expose the stuff I needed. It’s fine, I guess.

The upload code lives here in index.html. The use of a hidden file input is cute but not essential — it just gives us a little more control over the ux. Of course, calls to storage methods need to be authenticated; our approach is to ask our server to generate a “shared access signature” (SAS) token tied to the blob we’re trying to upload — which happens in freezer.js (double-duty for upload and download). The authenticated URL we return is tied only to that specific file, and only for the operations we need.

The code then calls the SDK method BlockBlobClient.uploadData to actually push the data. This is the current best option for uploading from the browser, but to find it you have to make your way there through a bunch of other methods that are either gone, deprecated or only work in the node.js runtime. The quest is worthwhile, though, because there is some good functionality tucked away in there that is key for large uploads:

  • Built in retries (we beef this up with retryOptions).
  • Clean cancel using an AbortController.
  • A differentiated approach for smaller files (upload in one shot) vs. big ones (upload in chunks).
  • When uploading in chunks, parallel upload channels to maximize throughput. This one is tricky — since most of us in the family use Chrome, we have to be aware of the built-in limitation of five concurrent calls to the same domain. In the node.js runtime it can be useful to set the “concurrency” value quite high, but in the browser environment that will just cause blocked requests and timeout chaos. This took me awhile to figure out … a little mention in the docs might be nice folks.

With all of this, uploading seems pretty reliable. Not perfect though — it still dies with frustrating randomess. Balancing all the config parameters is really important, and unfortunately the “best” values change depending on available upload bandwidth. I think I will add a helper so that folks can use the “azcopy” tool to upload as well — it can really crank up the parallelization and seems much less brittle with respect to network hiccups. Command-line tools just aren’t very family friendly, but for what it’s worth:

  1. Download azcopy and extract it onto your PATH.
  2. Log in by running azcopy login … this will tell you to open up a browser and log in with a one-time code.
  3. Run the copy with a command like azcopy cp FILENAME https://STORAGE_ACCT.blob.core.windows.net/CONTAINER/FILENAME --put-md5 --block-blob-tier=Archive.
  4. If you’re running Linux, handy to do #3 in a screen session so you can detach and not worry about logging out.

Thawing

Remember that files in the Archive tier can’t be directly downloaded — they need to be “rehydrated” (I prefer “thawed”) out of Archive first. There are two ways to do this: (1) just flip the “tier” bit to Hot or Cool to initiate the thaw, or (2) make a duplicate copy of the archived blob, leaving the original in place but putting the new one into an active tier. Both take the same amount of time to thaw (about fifteen hours), but it turns out that #2 is usually the better option for cold-storage use cases. The reason why comes down to cost management — if you move a file out of archive before it’s been there for 180 days, you are assessed a non-trivial financial penalty (equivalent to if you were using an active tier for storage the whole time). Managing this time window is a hassle and the copy avoids it.

So this should be easy, right? Just call beginCopyFromURL with the desired active tier value in the options object. I mean, that’s what the docs literally say to do, right?

Nope. For absolutely no reason that I can ascertain online, this doesn’t work in the JavaScript client library — it just returns a failure code. Classic 2020 Microsoft developer experience … things work in one client library but not another, the differences aren’t documented anywhere, and it just eats hour after hour trying to figure out what is going on via Github, Google and Stack Exchange. Thank goodness for folks like this that document their own struggles … hopefully this post will show up in somebody else’s search and help them out the same way.

Anyways, the only approach that seems to work is to just skip the client library and call the REST API directly. Which is no big deal except for the boatload of crypto required. Thanks to the link above, I got it working using the crypto-js npm module. I guess I’m glad to have that code around now at least, because I’m sure I’ll need it again in the future.

But wait, we’re still not done! Try as I might, the method that worked on my local development environment would not run when deployed to the server: “CryptoJS not found”. Apparently the emulator doesn’t really “emulate” very well. Look, I totally understand that this is a hard job and it’s impossible to do perfectly — but it is crystal clear that the SWA emulator was hacked together by a bunch of random developers with no PM oversight. Argh.

By digging super-deep into the deployment logs, it appeared that the Oryx build thingy that assembles SWAs didn’t think my API functions had dependent modules at all. This was confusing, since I was already dependent on the @azure/storage-blob package and it was working fine. I finally realized that the package.json file in the API folder wasn’t listing my dependencies. The same file in the root directory (where you must run npm install for local development) was fine. What the f*ck ever, man … duping the dependencies in both folders fixed it up.

Downloading and Deleting

The last of our tasks were to implement download and delete — thankfully, not a lot of surprises with these. The only notable bit is setting the correct Content-Type and Content-Disposition headers on download so that the files saved as downloaded files, rather than opening up in the browser or whatever other application is registered. Hooray for small wins!

That’s All Folks

What a journey. All in all it’s a solid little app — and great functionality to ensure our family’s pictures and videos are safe. But I cannot overstate just how disappointed I am in the Microsoft developer experience. I am particularly sensitive to this for two reasons:

First, the fundamental Azure infrastructure is really really good! It performs well, the cost is reasonable, and there is a ton of rich functionality — like Static Web Apps — that really reduce the cost of entry for building stuff. It should be a no-brainer for anyone looking to create secure, modern, performant apps — not a spider-web of sh*tty half-assed Hello World tutorials that stop working the day after they’re published.

Even worse for my personal blood pressure, devex used to be the crown jewel of the company. When I was there in the early 90s and even the mid 00s, I was really, really proud of how great it was to build for Windows. Books like Advanced Windows and Inside OLE were correct and complete. API consistency and controlled deprecation were incredibly important — there was tons of code written just to make sure old apps kept working. Yes it was a little insane — but I can tell you it was 100% sincere.

Building for this stuff today feels like it’s about one third coding, one third installing tools and dependencies, and one third searching Google to figure out why nothing works. And it’s not just Microsoft by any means — it just hurts me the most to see how far they’ve fallen. I’m glad to have fought the good fight on this one, but I think I need a break … whatever I write next will be back in my little Linux/Java bubble, thank you very much.