How does the “I’m not a robot” checkbox work?

Asking you to click a checkbox to confirm that you are, in fact, human seems curiously simple.

In today’s age, there’s a high chance that you, dear reader, are a machine. Maliciously-programmed internet bots (software applications that can run automated tasks) are an unfortunate commonplace on the internet. They can be used at various scales from generating fake social media accounts, to rapidly booking out all tickets for a popular concert and orchestrating a large-scale Distributed Denial of Service (DDoS) attack; a DDoS is an attempt to make an online service unavailable by overwhelming it with traffic. It’s the type of high-profile attack that can take down everything from banks to government websites.

A dystopian world like this needs a reliable way to differentiate an evil bot from a well-intentioned human. How can a banking website be sure that an innocent grandma who is logging in to check that the holiday gift money was successfully transferred to her grandchildren, is in fact, an innocent grandma? Enter, the “Completely Automated Public Turing test to tell Computers and Humans Apart”, or more simply, the CAPTCHA.

Just like internet bots themselves, and like much of the innovation on the internet, CAPTCHAs find their origin in the hacker community. Back in the ancient 1980s the hackers invented leetspeek to bypass security filtering on internet chat forums. Leet is a method of converting words to lookalike characters or abbreviations that cannot easily be interpreted by a computer:

leet > I33t
censored > c3n50red
porn (pornography) > pr0n

In the pre-Google days of the internet, websites would be manually submitted to search engines. In order to prevent the submission of fake websites, AltaVista implemented the first CAPTCHA-like system that required a user to type a series of distorted characters into a box. This approach, which we often still encounter when registering new accounts or submitting information on the internet, is based on three principles:

Humans can more easily recognise highly distorted, rotated or skewed characters.
Humans can more easily visually separate overlapped characters.
Humans can more easily draw on context to understand visually distorted characters, for example, identifying a character based on the full word that it appears in.

The search engine Alta Vista was one of the first popular websites that introduced a CAPTCHA-like protection when submitting new websites to its database.

In 2003, a research team from Carnegie Mellon University published a pioneering research paper that described many different types of software programs that could distinguish humans from computers. It was this group that also coined the catchy acronym. As CAPTCHAs became a status quo of security on the Internet, Luis von Ahn, a member of the original research team, became increasingly uncomfortable with how much valuable time was being wasted on solving these mini puzzles. In a wonderful 2011 TED Talk, von Ahn estimated that humanity as a whole was wasting 500,000 hours a day on completing CAPTCHAs.

Luis Von Ahn discusses how the collective amount of time wasted on filling out CAPTCHAs inspired the reCAPTCHA project.

Questioning whether this time could be put to more powerful and meaningful use, he developed reCAPTCHA, which was eventually sold to Google in 2009. These days, there are a number of projects and companies (including Google Books, the Internet Archive, Amazon Kindle and The New York Times) that are scanning and indexing large numbers of books, documents and images for use on the web. reCAPTCHA works by taking any of the scanned words that cannot be recognised and presenting them to a human alongside a known word for interpretation. By typing the known word correctly, you identify yourself as a human and the reCAPTCHA system gains some confidence that you have correctly digitised the second. If 10 other people agree on the transcription of the unknown word, the system will assume this to be correct. Today reCAPTCHA helps to digitise millions of books a year and has also extended to support other efforts like digitising street names and numbers on Google Maps or recognising common objects in photos for Google Images.

The original reCAPTCHA asks you to type a known scanned word to identify yourself as a human and to help transcribe another word that a computer was not able to recognise.

New forms of CAPTCHAs are also being used to help index images and data captured by Google Street View.

There are many other forms of CAPTCHAs, including an audio version for the visually impaired. But it is the curiously simple variety — the “I’m not a robot” checkbox seen on many of today’s websites — that inspired the original question behind this article. This checkbox, endearingly called the “no CAPTCHA reCAPTCHA”, is a Google product that unsurprisingly uses a combination of advanced Google technology to produce a very simple result. Google will analyse your behaviour before, during and after clicking the checkbox to determine whether you appear human. This analysis might include everything from your browsing history (malicious bots don’t necessarily watch a few YouTube videos and check their Gmail before signing up for a bank account), to the way you organically move your mouse on the page. If Google is still unsure of your humanness after clicking the checkbox, you will be shown a visual reCAPTCHA (with words, street signs or images) as an additional security measure. This multi-faceted approach is necessary as computers become more skilled at complex image recognition and with the rise of CAPTCHA sweatshopping (think a large room of underpaid workers tasked with generating a heap of fake social media accounts).