Every SaaS Integration needs an Upstream Point of Contact

Can you see a pattern in these examples?

  • SendGrid sends emails for you. You provide it with a webhook URL.  It uses the webhook to “call you back” about what happens with those emails, like when they are dropped because your request was invalid.
  • On AWS, you need to listen to an event bus if you want to know when your pods are about to be shut down (among other things).
  • You define special DNS records called DMARC, to be like “hey, if email that says it’s from me isn’t passing tests, help me out and send a report to this address.”
  • A web browser suppresses some JavaScript on your page because of the CSP policy set in your headers. It sends a notification to a URL you specify, to let you know about the suppression.
  • Back in the day, serial mice reserved a few wires for data going the other way, from the computer to the mice. This sideband communication included: “that’s too fast!”

Backpressure is an example of this pattern: the receiver of data gives the sender some signal that it needs to slow down or lose data.

Dead Letter Queue is another: messaging systems need someplace to put messages that just didn’t make it.

Error responses are not part of this pattern. If you can get a synchronous error message, cool, that’s definitely a way to tell a sender that something’s wrong. But some errors can’t be detected immediately.

Guaranteed message processing is more extreme than this pattern. There is no “guarantee” in “send and forget,” so clients need a way to know whether a message was processed successfully. But guarantees are expensive, and there’s a wider pattern that is important to systems that don’t need a guarantee.

There’s: “OMG this one message was randomly dropped in the network realities,” and then there’s: “We tried, and we can’t do what you’re asking.”

Every asynchronous process needs a way to say, asynchronously, “Something went wrong. Do this differently in the future.”

The pattern I see:

There’s a client who makes requests of a service provider. The receiver of these instructions acknowledges successful receipt quickly, and the client rests gratefully. Sometimes though, the receiver finds out later that the instructions can’t be followed. The client should listen when the receiver sends it an asynchronous FYI.

(Do you know an existing name for this pattern? Tell me on twitter.)

For example: at Honeycomb, we had an incident where trigger emails (“your latency passed this threshold,” for instance) weren’t being sent. This lasted a few days before a customer reported it! We deployed some code that made most of the email definitions invalid. SendGrid (the email-sending Saas) accepted our send request, but later dropped it. We weren’t listening to their event webhook, or we’d have noticed the jump in “dropped—invalid” events.

Another example: Honeycomb is a SaaS that accepts telemetry data. Sometimes the events we get aren’t storable. For example, this can happen when someone accidentally uses  a timestamp as a column name. They hit a limit of too many (thousands of) unique columns.  Honeycomb-supplied client libraries can warn client code about this immediately. However, OpenTelemetry is now the standard. In the OTLP protocol, all we can send back is “200 Success” or else reject a whole batch of events. There’s no backchannel for “Hey, this one request didn’t pass muster.”

A positive example: when a customer sends so many events that they’re going to exceed their rate limit, they get an email. They’ve configured a point of contact to hear about this problem.

Every downstream service needs an upstream point of contact.

This point of contact might be human. Ideally it also supports software.

At Cloudways, they call this a “bot.” You can get notifications in Slack as a human, or you can listen for a webhook in software.

At SendGrid, it’s a webhook. In case you want a person to watch events, they provide an open-source GUI app you can run on Heroku’s free tier.

At Honeycomb, we have a way to tell customers (humans) that the rate limit is in danger. We don’t have a way for them to hear about events dropped because they exceeded a (very high) column limit. Well—we do—it’s customer support. A human talks to a human. I’d like it if our software could talk to their software, too!

When there’s a systematic error, the service detecting the error needs to report that back to the service that can do something about it.

Implementing this backchannel takes some thinking.

SendGrid’s event webhook illustrates considerations in automating the upstream point of contact.

  • Passthrough fields so that the client can correlate the error with what they originally sent
  • Security: how does the receiver know it’s really from SendGrid?
  • Backpressure again! for when there’s a lot of events to report
  • Configuration: both a console and a REST endpoint for configuring it by hand or with software
  • Testing: how to know whether your configuration is working
  • Integrations into other SaaS that can read these events.
  • Code examples for software implementations
  • Human-usable versions for customers working at small scale.

SendGrid has an extensive implementation because it sends events like “Email delivered” and “Email opened.” So their webhook is on the happy path for many customers.

“Dropped—invalid” is a corner case for them. As a developer of an upstream application, it’s the error message that is precious to me.

An upstream point of contact (or backchannel) is a part of the developer interface of any SaaS.

I really want to be able to send instructions for an email and forget about it. That optimism is justified if I can be confident that I’ll hear about what doesn’t work.

In the old days of request/response, I could expect a synchronous error. But synchronous response is a deep coupling that limits our system design. Instead, I need a request/ack… and then later, an FYI.

If we’re gonna send instructions to SendGrid, we need to listen on its webhook. We need to pass our trace header as a custom field, and then link to our “email send” trace in the telemetry we create when we receive “Dropped—invalid.” That way we can see the story of what we did to cause the problem.

Hopefully OpenTelemetry standards will add a standard way to inform senders of unstorable events. Before then, maybe we’ll come up with our own.

Until that day, there’s the ultimate source of resilience in software systems: human relationships.