Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
REST and long-running jobs (2014) (farazdagi.com)
98 points by geezerjay on Dec 1, 2018 | hide | past | favorite | 59 comments


One thing this piece leaves out is that if the operation/creation is in fact expensive, you want to make sure you have some kind of two-phase option so you don't inadvertently create multiple copies.

If the client crashes after the POST to create 'death star' succeeds and hence construction starts, but before it can read the response of the POST in the example and so never gets the pointer to the /queue/12345 from the response, if the client reissues the POST in the server should reject a second attempt to create a star named 'death star' just and give back the same pointer to /queue/12345. Or, if you want to be able to have two stars named 'death star', you should first POST to some endpoint to get a unique identifier from the server first, and then use that to kick off the creation of your death star so the client can reason about the current state.


This. Too bad many APIs around are designed without keeping in mind that timeouts and double submissions can occur.

A few years ago I had to integrate with another software vendor (using SOAP, sigh), and development went all smooth, testing too but when the integration went live we noticed that the VPN connection was somehow not working properly and we were getting 20% packet loss. Luckily, for each API call that would write data we would issue a unique "requestId" and implemented retry logic so double submissions (eg. due to timeouts) were discarded.

Needless to say, that saved our day and the project went smoothly even though it took a week to the DevOps guy to debug and fix the faulty VPN tunnel.


> if the client reissues the POST in the server should reject a second attempt to create a star named 'death star' just and give back the same pointer to /queue/12345.

This use case is typically handled by replying to the duplicate POST request with a status code 409 Conflict, and include in the response a link to the resource representing the long-running job.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec1...


You would use PUT in the instance where you have the same id (in other words, the same resource representation/URL), and PUTs should be idempotent, meaning that you can re-attempt creation multiple times without side effects.

POSTs should not be idempotent. It shouldn't be expensive to generate a new ID, so you can just return the new ID/URL and then the client can check to see if/when that ID is created.

An important factor in REST is that it moves state processing to the client. The clients need to track transactions, not the server. The servers don't have a concept of a session like old-school web apps did. Any state (like an ID that was just created) should be tracked on the client.

Either a 202 Accepted for that ID resource that's not yet created, or a 4xx level error (4xx mean 'retry, this is not a permanent failure') such as 423 Locked could be appropriate while you are actually generating that resource.


Or you could just generate an unique identifier on the client and save the roundtrip? Uuids aren't really expensive.


UUID is not enough to prevent duplicate resource creation. Consider the following steps: 1. Receive a request with uuid, 2. Persist the fact that system received this request, 3. who or what processes it? Even if you are working off atomic persistence, the resource creation process (the kick-off process) needs to maintain the state that it plucked one off the queue or table, atomically update the state in persistence as it creates the resource somewhere else and log the successful creation of this resource as yet another entry in the persistence store. This way, requests that exceed their SLA can be retried and the downstream systems must act with correct idempotent behavior in order for all of this to work correctly in the presence of failures.

It is all very technically possible, but it goes beyond just a UUID (in other words, UUID is necessary but not sufficient).


The issues you list don't really seem to be relevant in regard to duplicate resource creation? They're just about handling creating the resource in the first place.

As long as the uuid is persisted you can redirect all attempts to recreate that id to the right job/queue/whatever actually backs resources.


> Your first instinct might be: “What if I return HTTP 201 Created immediately, but defer the actual creation to some later point?“... Well, you can’t do that. If you do, you will be violating the HTTP/1.1: Semantics and Content protocol

> We know what status code to return, what about Location header? Simple. Instead of the location to the actual resource (/stars/97865), API will return location of the queued task that got created (/queue/12345)

> How to learn when resource is finally available? You need to query the queued task... Once resource is created, API should respond with 303 See Other status code on all the subsequent requests to the queued task... There are two alternative ways to deal with the temporary task resource: API client must issue DELETE request, so that server purges it. or, garbage collection can be a server’s job to do

I've always wondered about how important it is to follow REST conventions, at the cost of simplicity, when you control both the server and client software.

In the example given above, the recommendation to use a 202 instead of 201, sounds perfectly reasonable. However, creating and returning to the client a new queue resource, which returns a 303 when complete, and the client then has to delete... Is it really so bad to just return /stars/98765 immediately? And when queried, have the relevant GET API return a response indicating that construction of the resource is still in progress? If the client needs to wait for construction to complete before doing something else, can't it just poll /stars/98765 instead of /queue/12345?

More generally speaking, if you're designing an API for use by your own client, and you're facing a tradeoff between interface/implementation simplicity vs following the official REST protocol, is it really worth choosing the latter?


HTTP is an inherently complex protocol, which has over time accrued many idiosyncratic, non-orthogonal features to support various use cases of the growing Web. Just consider that there exists an entire class, entire design space of libraries known as “HTTP routers” which boil down to extracting arguments from the first line of an HTTP request.

If you want simplicity, and you fully control both sides, and you don’t care about the systemic advantages that REST purports to provide, then your best bet is to avoid HTTP altogether (which in practical terms may of course mean tunneling through it) and stick to a simple, modern RPC protocol.


> Just consider that there exists an entire class, entire design space of libraries known as “HTTP routers” which boil down to extracting arguments from the first line of an HTTP request.

This comment sounds a bit disingenuous. Routing involves way more than extracting arguments from the first line of an HTTP request, but its complexity is not due to HTTP. Routing is based on content negotiation, and essentially everyone is free to design their own personal content negotiation process, and very often they do.

Take, for example, content type. Do file extensions matter? Does the Content-Type header mean anything? If both are used what should prevail? What should the router do if none was passed?

On top of that, then let's add HATEOAS, HAL, content type API versioning, custom headers, etc...

In the end developers need to map HTTP requests to an action, and HTTP is not the problem.

Libraries are helpful not because the problem is complex, but because ready-made solutions are better than rolling our own. There are plenty of libraries not because HTTP is complex, but because plenty of people have their personal preference.


> extracting arguments from the first line of an HTTP request.

The reason this sounds simple is that the use of “arguments” conceals a good deal of semantics. Yes, you can cut a huge amount of complexity out by using a subset of HTTP, and I do recommend that, but by switching to a “modern” rpc protocol you also lose a good deal of introspection, low-effort ramp up for new contributors, and highly replicated semantics on server and client side. You can debug or interact with curl (or a number of similar tools), or a browser, or any virtually programming language with many choices for implementations. There are tons of commodity tooling for log parsing, playback inspection, testing, fuzzing, load balancing etc etc. You get a lot for “free”, and the potential complexity by itself is a poor excuse for dumping the baby out with the bath water.

I also realize you may have acknowledged this by referencing “systemic” benefits, but I’d like to spell out just how large they are.


You may control the client, but do you control the client library? If you want to get the most out of something like, say, Python's requests, following these conventions are your best bet.


Case in point: CacheControl [1]. Plug it in and you get flexible caching — interoperable with NGINX, the browser, and everybody else — for very little. But it relies on the protocol. If you respond to GET /stars/98765 with 200 (OK) and an “under construction” placeholder, it’s going to get confused.

[1] https://github.com/ionrock/cachecontrol


This author misses a key algorithm design option: the API does NOT offer a CREATE on such a resource, the API offers a CREATE on a JOB which generates the long-time-to-create resource. So the API client simply creates the Job, receives an ID immediately, which can be queried for status, forever. Long after the job completed it serves as a timestamped record of that specific resource's creation. It is only after the Job status reaches predefined status levels are the assets (files/data) associated with it available.

I've used this pattern with success for over a decade. First with on-demand digital products, which required compiling/generating time after customer request. This pattern also supports partial generation of long-time-to-generate resources. I used it with a custom game avatar service, where various modification and customization options are visualized without creating them. The visualization of the modified product is a 'milestone' asset generated by the resource creation pipeline. If the end-user chooses not to purchase said modification, that specific asset's sub-job is killed or never started.

With complex, multi-layered digital products this pattern works quite well. Even after a period of time, the end-user could return to a multi-leveled job and request an asset they previously denied. That sub-job gets timestamped just the same as if it ran originally. It just works.


>This author misses a key algorithm design option: the API does NOT offer a CREATE on such a resource, the API offers a CREATE on a JOB which generates the long-time-to-create resource. So the API client simply creates the Job, receives an ID immediately, which can be queried for status, forever.

You should read the article because that's the approach it describes.


This. I believe this is the most straightforward and robust way to handle this scenario.


A coworker and I were discussing this very thing on Friday. He and I exhausted our Google-fu without success: Does anyone know of a framework/library that implements this? Don’t really care about the language, but per other comments we’re interested in the practical lessons learned that would show up in code.


RFC 7540 also defines an optional way for the client to signal whether it wishes the request to be processed asynchronously: https://tools.ietf.org/html/rfc7240#section-4.1


My first impression is that RFC7540 isn't appropriate for this use case, because it is proposed as a way for the client to signal optional/prefered behavior when said behavior is already made mandatory by the server.

To put it differently, why would it matter if a client POSTs with Prefer: respond-async if the server response will always be async? Whether that header is present or not, the HTTP response is already designed to always return a Location header.


Someone once gave a talk on 'the Art of API design' that covered logical design of a RESTful API and one of their points was in close alignment with the article.

Can't find and slides to share but will edit/reply to self if I can, the talk was very informative and covered not just resource creation but destruction, Discovery, and even a bit about authentication.


Anybody here who used this pattern in practice and found out it is not as good as this article portrays it?


Bottom line is if you’re doing this via REST and not using web sockets (which is fine, of course!) then you’re going to end up in a polling situation. That’s the important thing to think about. I had a bulk export job that ran similar to this, minus a lot of the frills. On the initial POST you get back a token that would allow you to poll the job status. Until you went to download the (often several GB) payload, all requests were extremely fast. At most we would insert a single record into a DB and queue a tiny object onto the message bus. In the status poll it was a simple key lookup.

In my experience nobody seriously tries to figure out how to use your API by calling it, they’ll prefer to look at your code examples and documentation.

Nobody is going to build a generic-enough REST client to make all the HATEOAS crap matter for their workflow. In my view it’s a waste of time, but maybe others have found it measurably useful?


> Bottom line is if you’re doing this via REST and not using web sockets

Websockets is not a viable solution for this problem because long-running processes are long-running, and it would make no sense to keep a connection up for minutes/hours just to poll the process status.


Sure, good point. I’d argue this isn’t exactly black and white but you won’t find me defending web sockets for many applications at all anyway.


I'm using it, it works fine. Two things to keep in mind though:

First, the post recommends that "once resource is created, API should respond with 303 See Other status code on all the subsequent requests to the queued task." That's a great idea, but NOT if you're, say, polling from a web browser via AJAX hoping to get some snippet of data to display to the end user. Browsers take 3xx codes seriously, so they will transparently follow the redirect. This is probably not what you had in mind, so you'll end up needing to use a 2xx code.

And that brings us to the second point: This post is basically "here's how to do something pretty basic in a RESTful and semantic way". That's fine! But as above, sometimes that's not feasible, and even when it is feasible, you could also just...you know...not be RESTful. REST isn't magic, and it's not a synonym for "an HTTP based API". And sometimes all you need is an HTTP API.

Yes, using a 202 Created response with a Location header for where to poll is clever and (I think) elegant. But it's functionally identical to a 200 Okay response with the URL in the body. Similarly, a link to the cancellation method when you poll is cool; you'll feel smug showing that off during code review. But...no client is magically going to work with this (or any other) scheme; you're going to have to hand code all the logic, come up with a schema for conveying job status, etc.

If you enjoy the challenge of making the logic as RESTful as possible while you're doing that, then sure, this describes a good way to be pretty RESTful. But don't feel like you're missing out on much if you decide not to do all that. There's no actual benefit from using HTTP response codes in a super semantic way. :)

In short: It's as good as the article portrays it, but keep in mind the problem being solved is "handing a long-running job RESTfully". Handling a long-running job via an HTTP API is a different (simpler) problem.


I've been doing it in production at a rate of a few million individual asynchronous transactions per day for five years now.

It adds a lot of complexity to the client that this article doesn't touch on. The client has to maintain state or you have to provide a means for it to reconstruct it's state. Clients can misbehave in ways that weren't as likely with simple synchronous creates.

Where ever we can, we favor things like websockets and webhooks to push state back to the client. Unfortunately, that isn't always possible and doing something like this article describes is the best option.


This layout is similar to something I've done in production. In my use case we don't support stopping of jobs and I return job completion status in aggregate.

I also avoid confusion in status codes and implementation by separating my resources for requests of a resource and the resource itself into different endpoints.

    POST /<resource>-requests
    {
        ...
    }

    GET /<resouce>-requests
    [
        {
            "id": <id>,
            "status": "complete",
            ...
           "url": "/<resource>/<id>"
       }
It all comes down to your queue processing system and making sure it's sensibly configured. For long running jobs (1+hr) the stack we use at work (Laravel/Lumen) come very poorly configured.


I have. I used it when creating a long running report. While polling, the end point would also return a status update - the percentage complete and where it was in the process.


It sounds pretty sensible to me, I filed it away for next time I need to do something like this.


These patterns are geospecific. They are only known to work within the vicinity of the town of Theory.

You may find several deviations and compromises that occur outside this region, though the intention is often to emulate a loose collection of (sometimes) consensus based best practices.


They work just fine for millions of services; they're called "websites". When you click on a link/button that starts a long running task, you often get redirected to a status page which refreshes (polling), and which has a link to cancel.


Instead you could also POST to /star-construction to get back a link to a /star-construction/foo resource, representing a long-running creation process.

This resource could be polled to get progress updates, and eventually a link to the created star resource, or perhaps an error describing why it didn't work.

The advantage is that you don't have to deal with two different types of resources being returned at the same URL, which complicates parsing the response. If creations are queued you also can get a view on the workload still in front of you.


Better options would be:

1. Send partial HTTP response (206) with progress details which may be handled using AJAX progress API at the client before sending final response. This is processing and network intensive but may be needed in certain scenarios like drawing graphics etc. You cannot recover from connection loss.

2. Handover the task to a job scheduler and return the jobid. Client can then poll jobid using a status API or get the status pushed using a webhook. This is the more optimal approach for majority of cases.


> 2. Handover the task to a job scheduler and return the jobid.

That's the approach described in the article. Where does your suggestion differ from it?


206 is only valid to respond to range requests, which can only be GET requests.


Additionally, the 206 suggestion fails to take into account basic use cases such as having more than one queued job.


Number 2 is super easy to implement with libraries like celery.


#2 is essentially what the article describes.


I found this discussion a bit incomplete as it didn't really discuss the consequences or costs or benefits of modelling things one way or another.

Suppose we didn't use REST or HTTP for our API. What would be a good way to model this kind of thing?


This is essentially just an asynchronous response to some request. I expect there are probably three cases:

1. Response is unimportant: do nothing

2. Response timing is not important: client polls

3. Response timing is important: server calls back into the client somehow

The article describes an implementation of point 2. Depending on the technologies available at the time, I'd argue 3 is probably more versatile. Both cases 1 and 2, and the blocking case can be implemented with case 3, if the available technology allows it.


I suppose one way of modeling a solution to this space is to look back at goold 'ol fashioned RPC's


This article is missing one detail: what if resource creation fails? How do you signal that failure when the client calls GET on the task queue resource, in a way that's distinguishable from a failure to retrieve the task queue status?


That's like asking whether you should send a 404 when you serve an article about a city that no longer exists.

HTTP status codes refer to the resource being requested (in this case, the creation task itself), not to something that resource refers.


You can't map everything cleanly to HTTP status codes. When you implement the task queue resource, you'll need to come up with ways of representing what information is important to you (status, success, failure, eta, progress, etc.), and then when you implement the client you'll have to parse that.


I think it depends a lot on why the operation failed and what you need the client to do next. If it is something the client can fix, I think it is reasonable to respond with the appropriate 4xx series status and enough to detail that they can correct and proceed. Same with 5xx series and issues that are out of the client's control. There are other approaches that are more or less as valid. The important thing is that we return enough information for the client to proceed. Wether that is signalled through status codes, headers, or response body isn't as important in my opinion.


> This article is missing one detail: what if resource creation fails?

I didn't get your point.

If the resource creation fails during the request then either a 4 or 5 status is returned.

If the resource creation succeeds but the long-running process fails in some way, either during processing or at creation time, then the resource representing the long-running process will be updated by the server to inform the client of the process' current status.


“Once the originally desired resource is created, there are two alternative ways to deal with the temporary task resource:

API client must issue DELETE request, so that server purges it. Until then, server responds with 303 See Other status. Once deleted, 404 Not Found will be returned for subsequent GET /queue/12345 requests. or, garbage collection can be a server’s job to do: once task is complete server can safely remove it and respond with the 410 Gone on subsequent GET /queue/12345 requests.”

How do you distinguish between an id referring to a finished task that no longer exists (410 Gone), and an id that never referred to any task (404 Not Found)? Other than the obvious solution of keeping every task in the DB with a “finished” flag, which doesn’t scale...


https://tools.ietf.org/html/rfc7231#section-6.5.9

> It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.


I'd say "gone" is a temporary state that exists shortly after task completion. A task shows as "gone" for, say, 15min, and then is deleted, producing a 404 in subsequent requests.

On further thought, it seems to me that we necessarily need to keep completed tasks around until asked by a client to delete them. If a task produces an exception, we need to notate that for when the client next asks about the task's status. If the task ran successfully to completion, we need to notate that for when the client next asks. If the client needs to know all this, then it's the client who needs to ask for deletions as well - we shouldn't assume it's OK to remove records about tasks (unless there's a published policy that completed/error data will be purged after a certain period, e.g. 30, 60, 90 days...)


I’d say after a period of tolerance (410) you switch to 404. If you care about historicals, move the record to a dedicated DW solution and use that to report.


> Other than the obvious solution of keeping every task in the DB with a “finished” flag, which doesn’t scale...

Why doesn't that scale?


Because your storage requirements would grow indefinitely as you would need to store a record for every task ever performed.


If your IDs are completely unique, you could add them to a Bloom filter [0] when cleaning up the resource. Then you could use that to determine if an incoming ID has possibly been used before or definitely not.

[0] https://en.wikipedia.org/wiki/Bloom_filter


Bloom filters are probabilistic, meaning they don't always produce the right result. From the article you posted:

> False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.

So I don't think they're the right choice for this problem, if you actually depend on the result "not here" vs "never was here" being right.


Or you could just have your IDs be monotonically increasing and look at the minimum. Any task prior to that number is "410 Gone" by definition.


> Because your storage requirements would grow indefinitely as you would need to store a record for every task ever performed.

The only info that needs to be stored is the ID of the finished task. If, say, the ID is expressed as a GUID then we're talking about 4 ints or 2 longs worth of data.

Even so, these records are supposed to be garbage collected.


You could keep it very tight. IME it’s the index overhead that kills (1000 times as much data makes inserts 10x slower), and ever since partial indexes became a thing that’s a bit less of a problem.

Also deleting is far from free of side effects.


Sure, but what's wrong with that?


So? Storage is cheap. Even supposing you were going to store every single one as a 16 byte UUID, you can store 2^32 such ids in 2^32*2^4=2^36 bytes=~64GB. Is that a large database in your mind?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: