I Am Not A Robot : Planet Money - Chia Sẻ Kiến Thức Điện Máy Việt Nam

JACOB GOLDSTEIN, HOST:

In the year 2000, everybody was signing up for Yahoo email addresses. This was back before Gmail. And Yahoo Mail was great. It was free. You could check your email anywhere. But there was this one problem.

NOEL KING, HOST:

And a computer science grad student at Carnegie Mellon University became fixated on this problem. His name is Luis von Ahn.

LUIS VON AHN: The problem was that there were people who, in order to send spam from Yahoo accounts, they would obtain millions of email accounts.

GOLDSTEIN: Literally millions – not hundreds or thousands, but millions of fake email accounts.

VON AHN: Literally millions of fake email accounts.

KING: The spammers who were signing up for millions of spam accounts weren’t going to the Yahoo Mail page and just signing up for these accounts one at a time.

GOLDSTEIN: Yeah. It wasn’t like they were, like, [email protected], [email protected], [email protected].

KING: No, spammers were writing simple computer programs – little bots that just kept filling out the Yahoo email sign-up form again and again and again, day and night. And that would generate an army of email accounts that could be used to sell fake Viagra or steal your bank account information – whatever.

GOLDSTEIN: Yahoo didn’t know what to do about this, but Luis von Ahn had an idea.

VON AHN: So the idea was can we make a test that can distinguish between humans and computers, but also a test that is graded by the computer.

GOLDSTEIN: If you’ve basically ever signed up for anything on the Internet, you probably know the idea that Luis von Ahn came up with – a picture of distorted letters and numbers, and then a little field below that picture where you type in the characters that you see.

VON AHN: And we actually showed it to the guy who was the chief scientist at Yahoo. He loved it. And within a few weeks, it was actually, you know, in the registration flow of email accounts at Yahoo. It was being used there, and we were super happy that they were just using it.

GOLDSTEIN: Luis gave his little test a name. It was a long, ridiculous name that made a short, genius acronym. The long name was…

VON AHN: Completely Automated Public Turing Test To Tell Computers And Humans Apart.

KING: A Turing test is a famous old idea in computer science. It’s a test where you try to tell if you’re chatting with a computer or with a human being. If a computer can consistently make you think it’s a human being, that is artificial intelligence.

GOLDSTEIN: And this sort of Turing test that Luis came up with – it became huge. You may even know the acronym for this test.

VON AHN: CAPTCHA.

GOLDSTEIN: CAPTCHA. CAPTCHA is very compelling. CAPTCHA is a good name.

VON AHN: Yeah, because it’s like capture them or gotcha or something like that. Yeah, it was a good name.

GOLDSTEIN: You know CAPTCHA. Maybe you do not like CAPTCHA. And yet, the 20-year history of CAPTCHA is this window into a lot – into artificial intelligence, into digitizing millions of books, also into a little cybercrime.

(SOUNDBITE OF LEIGH MCALLISTER GRACIE SONG, “LET’S START A MOVEMENT”)

GOLDSTEIN: Hello, and welcome to PLANET MONEY. I’m Jacob Goldstein.

KING: And I’m Noel King, and I am not a robot.

GOLDSTEIN: I am not, not, not a robot.

KING: But if you were…

GOLDSTEIN: That’s exactly what I would say.

KING: (Laughter).

GOLDSTEIN: Today on the show, a global, decades-long war.

KING: An Internet that people actually use versus a spammy wasteland.

GOLDSTEIN: It is computer versus computer. And in the end, the computers are only going to need us, the humans, to do a little light data entry.

(SOUNDBITE OF LEIGH MCALLISTER GRACIE SONG, “LET’S START A MOVEMENT”)

GOLDSTEIN: I should say that Luis von Ahn was one of several people working on CAPTCHA-like tests around the same time, so he’s not the only person who came up with the idea. Other people developed their own CAPTCHAs. But Luis and his colleagues are the ones who came up with the name, and their version was the one that really took off.

KING: When you bought tickets online, when you signed up for your Myspace account – pretty soon, people were taking Luis’ little test 200 million times a day. It was protecting the world from scalpers and spammers and bots. And the world, of course, responded with gratitude.

VON AHN: Every time that I talked to somebody about CAPTCHAs, you know, the first thing they would tell me is how annoying they are. So I started feeling, you know…

GOLDSTEIN: Fair. Fair.

VON AHN: (Laughter) I started feeling partly responsible for these, you know, 200 million times a day. And each time you type one of these, you waste about 10 seconds of your time. So you know, I started just thinking, is there any way in which we can make good use of these 10 seconds?

GOLDSTEIN: This was in the mid-2000s. And at this moment, there is this push going on to digitize old books and old documents. And at the time, it was easy enough to scan old pages, old pieces of paper and put them online. But computers were still bad at turning those scanned pages into useful online documents.

VON AHN: It’s not searchable. You cannot change the font size. You cannot – I mean, it’s just a bunch of kind of somewhat crappy pictures.

GOLDSTEIN: (Laughter) Yeah.

VON AHN: So it occurred to me that you could take all of the words that the computer could not recognize, and we could get people to read them for us while they were typing CAPTCHAs on the Internet.

KING: Up to this point, Luis has been giving CAPTCHA away for free. But now he thinks people might pay to have their print archives digitized one CAPTCHA at a time. And he is sitting on over half a million hours of free human labor a day. So he starts a company called reCAPTCHA. And he goes out looking for customers.

VON AHN: And what happened was I was actually giving a talk somewhere, and I was fortunate that the – at the time, the guy who was the chief information officer for The New York Times was sitting in the audience.

GOLDSTEIN: OK.

VON AHN: And he said, you know what? We have this huge, you know, 130-year-old archive of old editions of The New York Times. So maybe we can – maybe you can help us.

KING: The New York Times ended up being reCAPTCHA’s first client. Now when you solve a CAPTCHA, next to a few random letters and numbers, there was also a picture of a word from an old issue of the Times that computers couldn’t read. When you typed in that word, you weren’t just protecting the Internet from spam. You were also helping to turn a hundred years of old newspapers into a searchable digital archive.

GOLDSTEIN: And I have to say, I just love this sort of – while you’re doing one thing, you’re also doing something else, like, efficiency of this. Like, it just – it delights me. You know, it’s like the old dream of riding your exercise bike to power the lights in your house or something, which, by the way, Luis told me when he was, like, 11, he had that dream. And then he, like, looked into it and realized, oh, actually, like, a person riding an exercise bike is a terrible way to generate power.

KING: As Luis was getting reCAPTCHA going, Google came out and announced that they were starting to digitize every book – like, every single book in the world or something. They saw what Luis was doing for the Times. And in 2009, Google bought reCAPTCHA and started using it to help digitize books.

GOLDSTEIN: And then a few years later, Google started using CAPTCHA tests that showed pictures of addresses on the sides of buildings. When we solved those CAPTCHAs, we were making Google Maps work better, doing a little more work for Google.

KING: Unpaid. So that is the end of Luis’ CAPTCHA story. But a little digression. He started a language company where people did online translation while they were learning the language.

GOLDSTEIN: Same, like, doing-one-thing-actually-doing-other-thing idea. This company became super popular. It’s called Duolingo. In fact, it got so popular that they got rid of the translation part. Now it’s just this app that millions of people use to learn languages.

KING: OK. End of Luis. End of digression.

So now it is the aughts. And for a while, CAPTCHA is working. The spammers are held at bay. And then someone figures out a workaround.

GOLDSTEIN: Shady businesses started showing up online and offering to break CAPTCHA for anybody willing to pay. Chris Kanich is a computer scientist who started looking into these businesses around 2008.

CHRIS KANICH: And it’s one of those things. Like, until you actually think, oh, actually, let’s go seek this out, see how hard it is to find, you might think, oh, this is some shady cybercriminal underground thing. But, nope. You can just Google for it. You can find a dozen of these services very competitively priced with, you know, all that stuff.

GOLDSTEIN: Chris wanted to know, like, what’s even going on here? Like, is it for real? Do these services work? So he and his colleagues decided to act like spammers. I mean, they didn’t do the spam part, but they did, more or less, everything else. They built a bot. And this bot went around the Web, bumping into CAPTCHAs. And automatically, every time the bot hit a CAPTCHA test, it would send it off to one of these services that offers to solve CAPTCHAs for money.

KING: And what do those services do exactly? They pay human beings to sit in front of computers all day long and solve one CAPTCHA after another. So some person sitting in front of a computer gets a CAPTCHA from Chris’ bot, solves the test in a couple seconds, sends it back to Chris’ bot, which enters the solution into a webpage. And boom.

GOLDSTEIN: Yes, that is the plain, vanilla version, just to see, like, does this work? How long does it take? But Chris and his colleagues also had some other questions they wanted to answer, so they did something else. They made up their own CAPTCHA tests to send out to the solvers. Some of those tests just said, what time is it? And the answer to those tests told them what time zone the people solving the test lived in.

They also wanted to know what languages the solvers spoke, so they made CAPTCHAs with weird instructions in lots of different languages.

KANICH: Chinese, Spanish, Italian, Tagalog, Portuguese, Russian, Tamil, Dutch, Hindi, German, Malay, Vietnamese, Korean, Greek, Arabic, Bengali, Canada, Klingon and Farsi.

GOLDSTEIN: I’m sorry. What was the one before Farsi?

KANICH: Klingon.

GOLDSTEIN: Klingon, the made-up language from “Star Trek.” I mean, of course. No, but they did it because they wanted to just, like, sort of push these services. Like, how far will these CAPTCHA-solvers go?

KANICH: There’s no way they’re going to be able to answer this. But the answers we saw – it showed us – I think we got something like a 1% accuracy rate, but it was on something that was so incredibly long of a question that it couldn’t have been right by chance. So presumably, one of these CAPTCHA-solvers recognized that this was Klingon, either knew Klingon, just because that’s what some people know, or looked it up online and was actually able to successfully solve this CAPTCHA that was written in Klingon.

GOLDSTEIN: You found the greatest CAPTCHA-solver on Earth.

KANICH: Yes.

KING: Based on this part of the study – not the Klingon part, but the languages and the time zones – they figure out that a lot of people seem to be doing this work in Russia, China and India. And they realize this is a huge industry. People have started calling it CAPTCHA farming. And it is basically human beings opening the CAPTCHA gates for an army of bots. And CAPTCHA farms work.

KANICH: Usually, they were right. Usually, they were very fast.

GOLDSTEIN: So the services were legit. I mean, they were potentially illegal, but they did the thing that they said they were doing.

KANICH: Correct. Yeah. One of the most interesting things about cybercrime as a marketplace is that it works like any other, like, business-to-business-type marketplace. Your reputation is really important there. And you’re not going to keep your CAPTCHA-solving business in business unless you’re actually solving those CAPTCHAs.

GOLDSTEIN: And how much did it cost?

KANICH: One U.S. dollar per thousand CAPTCHAs solved, so incredibly, incredibly inexpensive. But this is a task that takes a typical human, you know, about 15, 20 seconds.

GOLDSTEIN: Yes. But God, I mean, you really feel for the people doing the work.

KANICH: Yes.

KING: So if you are a spammer, these workers will solve CAPTCHAs for you around-the-clock for a tenth of a penny per thousand.

GOLDSTEIN: That price is obviously, like, mind-breakingly low. But it is still not zero. And it is still enough to weed out a lot of people. It weeds out, you know, people who are just trolls making spam accounts for fun and people who are just posting garbage comments on a million garbage websites hoping to sell a few extra dollars’ worth of garbage products. So even with the CAPTCHA farms, CAPTCHAs still are working to block a lot of people.

KING: But what would happen if you could teach a computer to solve the CAPTCHAs? Then you wouldn’t need the farms and the farmers anymore. The price would go to zero, and the spammers could go wild.

GOLDSTEIN: So this thing kept happening for a long time where some spammer would program a computer to solve CAPTCHAs. Then the people making CAPTCHAs would make the tests harder. And eventually, the tests got too hard for human beings to pass. So in 2014, Google said, all right, enough; we are going to completely change the way CAPTCHAs work. They called it reCAPTCHA version 2.

KING: And there are two big changes here. First of all, they introduce the I-am-not-a-robot check box.

GOLDSTEIN: Which I have always been kind of mystified by ’cause I always figured, like, couldn’t a robot just check a check box? But what I learned when I was working on this show is when you click on that check box, the site sends back a bunch of information to Google. We’ll talk more about what that information is a little later in the show. But Google takes that information and decides whether it thinks you’re a person or a robot. If it’s pretty confident that you’re a person, you just get to keep going on about your business. But if it thinks you might be a machine, then you have to take a test.

KING: And the test is the second thing that’s really different in this new reCAPTCHA. This is when they start using those tests that said, click on the picture of a cat, or, click on all of the pictures of a car. And the reason they made this is it is much, much harder for a computer to beat than just distorted pictures of numbers and letters.

GOLDSTEIN: I talked about this with Jason Polakis. He’s a computer science professor who has studied CAPTCHA.

JASON POLAKIS: So an example is that you want a system where you give it a picture of a car and it’s going to tell you, this is a car. But the problem is that you can have a hundred different types of cars – different angles, different colors, different shapes, different models. So expand from cars to trucks to boats to planes to trees to dogs to animals to humans – everything you can imagine.

GOLDSTEIN: Just even – you know, if we just go back to the beginning of CAPTCHA – right? – in around 2000-ish, was there any system in the world that could’ve looked at a picture of a glass of wine or a picture of a dog and said, that’s a glass of wine, or, that’s a dog?

POLAKIS: No, not even close.

KING: In order to get a machine to solve this kind of problem, you would need millions of labeled digital photos of different things and incredibly powerful computers and very, very clever software that could take all the photos and learn to recognize cats and wine and dogs.

GOLDSTEIN: As it happened, all of those things happened to come together in the years right before reCAPTCHA version 2 came out. You know, the Web provided millions of labeled images, and then computer scientists started using this technique called deep learning, or machine learning, that, essentially, trained powerful computers to see.

KING: Deep learning or machine learning – this is maybe the biggest behind-the-scenes technological innovation of the last 20 years. In fact, today, a lot of the time when people talk about AI – artificial intelligence – this is what they’re talking about. It’s essential to the progress of a lot of things. Take autonomous cars. This is how a car can see a kid running into the street and hit the brakes.

And it’s not just images. Researchers have used other data sets to teach computers to recognize fraud, to translate languages, to understand human speech and, like, a million other things.

GOLDSTEIN: So when this new reCAPTCHA version 2 came out, Polakis immediately thought, I want to see if I can use these new machine learning systems to break this new reCAPTCHA. And Polakis is not a machine learning expert. And that is actually sort of the point here because by this time, machine learning had progressed so much that you didn’t have to be an expert to use it. There were these basically off-the-shelf machine learning systems that anybody with, like, a medium amount of technical knowledge could use.

So Polakis and his colleagues took these systems and sort of threw them up against reCAPTCHA – tried to break reCAPTCHA with these systems. And it worked.

POLAKIS: Now, I’m not a machine learning person, so the fact that machines have reached that point and are that good at tasks, to me, is mind blowing.

KING: Polakis publishes a paper about his project in 2016, and other researchers are doing the same thing. We beat reCAPTCHA. We beat reCAPTCHA.

GOLDSTEIN: And, of course, Google knows this is happening. In fact, Google itself is doing all this work in machine learning. They know better than anybody in the world, maybe, that machine learning is just going to keep getting better and better at solving reCAPTCHAs.

KING: Aaron Malenfant runs the reCAPTCHA engineering team at Google. So if this is anyone’s problem, it is his problem. One day a couple years ago, he was telling his boss about all the things that his team was trying to do to stay ahead of the bots. But eventually, he had to break it to her. Pretty soon, the bots are going to be able to solve any kind of CAPTCHA challenges – any tests.

AARON MALENFANT: I mentioned to her that in the next three to five years, the current challenges are no longer going to be working. We need to move to a new system. And so she turned to me and just asked me, why aren’t you doing that?

GOLDSTEIN: So he did. Aaron and his team started building a new kind of reCAPTCHA, where there is no test at all – no letters, no pictures, no street signs. Apart from a little notice at the bottom of a webpage, you don’t even know reCAPTCHA is there.

KING: Google just released it last year, so websites are still in the process of switching over. It’s called reCAPTCHA v3 – version 3, reasonably enough. The way it works is you go to a website, and Google analyzes your behavior and then sends a score to the website. And that score is the probability that you are a human being. And then it’s up to the website what to do. Maybe they make you log in again. If you’re submitting a comment or a review, maybe they send it to moderation.

GOLDSTEIN: So that’s, like, the steps of what happens. But there is this black box – right? – which is what’s going on in the background. How is Google deciding whether or not I’m a robot? And we know a little bit about how this worked in version 2 when you clicked the check box. In that case, Google looked at whether you had visited Google sites, whether you were logged in to a Google account. And also – and this part was almost entirely secret – Google did some kind of analysis based on what you did on that page. Like, you know, maybe it had to do with how you moved the mouse or something.

So like, this black box is, of course, you know, the most interesting thing to me. So when I was talking to Aaron about this new reCAPTCHA – about version 3, I just said, like, what can you tell me about what you’re doing to decide whether I’m a human? So what’s public about the way in which you’re making that risk assessment?

MALENFANT: Yeah. So the reason we don’t say too much is that we do have adversaries trying to beat us at all times. We do say publicly that we adapt to a particular site and behavior for that site. And when I say we, I don’t mean people. I mean the machine looks at it and decides, is this request expected for this website? Does it look like other good requests to this website?

KING: In other words, for every site that installs reCAPTCHA version 3, Google’s computers, using machine learning, analyze specific behavior for that website. And then every time a user comes to that website, they say, is this user doing what a person normally does on this site, or are they acting weird?

GOLDSTEIN: And this is a big change because, you know, in the sort of reCAPTCHA world that we’re used to, there’s, like, a few different kinds of tests. And if you could build a bot that could pass those tests, then you could get through any reCAPTCHA on any website. It was like you had the keys to all the locks. Now with version 3, it’s like Google is making a different lock for every single website.

MALENFANT: This means that if – just because you can get a good score on one website if you’re an attacker doesn’t mean that you can get a good score on all the websites.

GOLDSTEIN: OK. I get that. That makes sense. I mean, if that’s all true, does it matter at all in v3 whether I am logged in to Google or have a Gmail account or any of that? Is that even relevant for how you analyze the risk?

MALENFANT: I would say it matters a lot less.

GOLDSTEIN: A lot less than what?

MALENFANT: Than it used to.

GOLDSTEIN: OK. OK.

MALENFANT: Our goal is that it doesn’t matter.

GOLDSTEIN: That’s your endgame.

MALENFANT: Right, because we do want to…

GOLDSTEIN: We’re not there yet. You’re not there yet, but you’re getting there.

MALENFANT: Well, I don’t like to – I don’t even want to – I probably already said too much.

GOLDSTEIN: A few days after I talked to Aaron, a Google spokesman got in touch with me and was able to tell me a little bit more about this. What he said was whether or not a user has an active Google account – that matters to Google’s risk analysis. But reCAPTCHA does not use any information about what the user does on his or her account. So they’re not looking at, like, how often you send emails or browse YouTube. It’s just, do you have an account – yes or no?

KING: For all of the secrecy and the cleverness in this new kind of CAPTCHA, it’s still a computer deciding who is a human.

GOLDSTEIN: But now the CAPTCHA test isn’t just in a little box at the bottom of the page. The test is everything you do on the page. So when the spammers’ computers figure out how to break this reCAPTCHA – and if there’s one thing we learned it’s that somebody somewhere will break it sometime – the computers won’t just be solving a little test. The spammers’ computers will be imitating everything we do.

KING: End of file.

(SOUNDBITE OF LEIGH MCALLISTER GRACIE’S “NUMBERS GAME”)

GOLDSTEIN: PLANET MONEY has a new newsletter. Every week, it dives into economic questions like, why did a bunch of countries in Europe have a wealth tax and then get rid of it? And do plastic bag bans really help the environment? Sign up for the newsletter at npr.org/planetmoneynewsletter. Again, npr.org/planetmoneynewsletter.

Today’s show was produced by Alexi Horowitz-Ghazi and Liza Yeager. It was edited by Bryant Urstadt. PLANET MONEY’s senior producer is Alex Goldmark. I’m Jacob Goldstein.

KING: And I’m Noel King. Thanks for listening.

(SOUNDBITE OF LEIGH MCALLISTER GRACIE’S “NUMBERS GAME”)

GOLDSTEIN: I have failed many, many CAPTCHA tests in my life. And so I was – I think I’m going to say comforted when our producer Alexi found these videos on YouTube of people losing their minds over CAPTCHA tests.

(SOUNDBITE OF MONTAGE)

UNIDENTIFIED PERSON #1: OK, so apparently, I am incapable of identifying street signs in CAPTCHAs. Now, that’s not a street – is that a street sign? It’s on the street, and it’s a sign.

UNIDENTIFIED PERSON #2: Is that a bus? I think it’s a truck, but it could be a bus. Bus. Bus. Bus.

UNIDENTIFIED PERSON #3: Are there any more cars here? There aren’t, right? Right? Wrong.

UNIDENTIFIED PERSON #4: And now Google throws a foe at you that you’ve never seen before – roads.

UNIDENTIFIED PERSON #2: Is that a store front?

UNIDENTIFIED PERSON #1: These are incredibly low resolution.

UNIDENTIFIED PERSON #5: Did someone drunk make this CAPTCHA?

UNIDENTIFIED PERSON #2: And is this part of the traffic light? And if this is, is that?

UNIDENTIFIED PERSON #1: Apparently, I’m just wrong. I’m not a robot. I’m not.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.