The evolution of Captcha has taken us from identifying weird letters to training AI by translating images to text. Now, finally, users won’t have to do anything
The experience of squinting at distorted text, puzzling over small images, or even simply clicking on a checkbox to prove you aren’t a robot could soon be over, if a new Google service takes off.
The company has revealed the latest evolution of the Captcha (short, sort of, for Completely Automated Public Turing test to tell Computers and Humans Apart), which aims to do away with any interruption at all: the new, “invisible reCaptcha” aims to tell whether a given visitor is a robot or not purely by analysing their browsing behaviour. Barring a short wait while the system does its job, a typical human visitor shouldn’t have to do anything else to prove they’re not a robot.
It’s a long way from the first Captchas, introduced to stop automated programs signing up for services like email addresses and social media accounts. The idea is simple: pick a task that a human can do easily, and a machine finds very hard, and require that task be completed before the process can be continued.
The first captchas often relied on obfuscated text: a few letters and numbers, blurred, distorted, or otherwise rendered hard to parse with conventional character recognition software. Even then, they were still bypassed fairly frequently. The limited number of characters available in the Latin alphabet meant that the software could quickly improve to a passable level of accuracy, while obfuscating the letters any further could lead to real humans – particularly those with poor eyesight – being locked out.
But the first big breakthrough in Captchas to hit the web had nothing to do with making it harder for robots to pass them. Instead, it was an insight that all the effort people were putting into staring at squiggly text could be far better applied.
Dubbed reCaptcha, the idea came from Luis von Ahn in 2008, a professor at Carnegie Mellon University who has since co-founded language learning startup Duolingo. Von Ahn realised that if humans were doing something that computers found hard – reading distorted text – they should at least be reading text that is useful.
ReCaptcha replaced the autogenerated text in previous Captchas with words drawn from scanned text such as newspapers, books and magazines: text that needed to be turned into computer-readable type. It still distorted the images, in order to keep computers out, but the real words typed in were fed back to the database to improve the original information.
That introduced a second problem, though: if a computer can’t read the word presented, how does the system know whether the user got it right or wrong? Von Ahn’s solution was to present pairs of words, one already solved and one unknown word. If the solution for the first one matches that given previously, then the user is probably a human – and so the second answer also gets added to the database, and subsequently presented to a new user.
The idea was compelling, particularly to one internet titan: in September 2009, Google bought reCaptcha. The purchase made sense. The company not only had a huge number of account creation requests, thanks to spammers trying to create gmail accounts en masse, it also had a significant corpus of text to digitise, the result of its controversial plan to scan in millions of books and newspapers. Those incentives also meant Google could make reCaptcha free for other companies to use, with the server costs being recouped by the valuable data.
But even though reCaptcha made proving you’re a human useful, it couldn’t beat the progress of automatic text recognition. As early as 2008, the Captcha concept was already starting to fall behind. Not only were robots getting better at reading even distorted text, but spammers were starting to use reCaptcha’s concept against it: if humans can do work better than robots, why not get them to do the work? By offering up something for free (this being the internet, it’s usually porn), a spammer could often convince people to solve other site’s Captchas for them, by just copying the image over.
Captchas have evolved in response, with Google introducing increasingly subtle technological tricks to try and tell whether a user is or isn’t a human. That culminated in 2014, when it introduced the “No Captcha reCaptcha”. The form looks like a simple box: tick it to confirm that you aren’t a robot.
Unlike text-based Captchas, the mechanisms by which Google tells whether it’s dealing with a robot were deliberately obscured. The company said it employed “advanced risk analysis” software, which monitors things like how the user types, where they move their mouse, where they click and how long it takes them to scan a page, all with the goal of working out which behaviours are human-like and which are too robotic.
That’s likely how the new Invisible reCaptcha works, although the company is even more silent with regards to that. In response to a request for elaboration, Google only linked to a promotional video.
But the No Captcha reCaptcha didn’t mean the death of useful Captchas. Instead, they’ve evolved too, moving beyond text to help Google’s other big data projects.
If Google decides you aren’t human with its weird voodoo, it will now show you a collection of images and ask you to unwittingly train its machine-learning systems in various ways. Some users might be shown a grid full of animal pictures and be asked to select every cat (useful training for Google Photos’ ability to search through you pictures for keywords you provide); others might be shown a picture taken from a Street View car and asked to type in the door numbers of houses (useful for improving the accuracy of the company’s maps) or select every part of the image that contains road signs (useful for training the company’s self-driving cars). Still others might be shown a picture of a military helicopter and asked to select all the squares that contain a helicopter (useful training for … well, probably for image recognition, but maybe for Google’s plan to take over the world with AI).
Ultimately, though, Google’s plan to remove the burden of reCaptchas altogether means that it will get less and less of this information from end users. But given the company’s scale, even the people who fail the invisible reCaptcha might well provide enough extra data to give Google’s AI plans yet more of a boost against the competition. Who knows, maybe the Invisible Captcha is also training an AI how to act like a human online?
guardian.co.uk © Guardian News & Media Limited 2010