Thank You for Being Late
However, in the time-honored tradition of programming engineers, Google, proud of what it had built, decided to share the basics with the public. And so it published two papers outlining in a general way the two key programs that enabled it to amass and search so much data at once. One paper, published in October 2003, outlined GFS, or Google File System. This was a system for managing and accessing huge amounts of data stored in clusters of cheap commodity computer hard drives. Because of Google’s aspiration to organize all the world’s information, it required petabytes and eventually exabytes (each of which is approximately one quintillion—1,000,000,000,000,000,000—bytes of data) to be stored and accessed.
And that required Google’s second innovation: Google MapReduce, which was released in December 2004. Google described it as “a programming model and an associated implementation for processing and generating large data sets … Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The … system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.” In plain language, Google’s two design innovations meant we could suddenly store more data than we ever imagined and could use software applications to explore that mountain of data with an ease we never imagined.
In the computing/search world, Google’s decision to share these two basic designs—but not the actual proprietary code of its GFS and MapReduce solutions—with the wider computing community was a very, very, very big deal. Google was, in effect, inviting the open-source community to build on its insights. Together these two papers formed the killer combination that has enabled big data to change nearly every industry. They also propelled Hadoop.
“Google described a way to easily harness lots of affordable computers,” said Cutting. “They did not give us the running source code, but they gave us enough information that a skilled person could reimplement it and maybe improve on it.” And that is precisely what Hadoop did. Its algorithms made hundreds of thousands of computers act like one giant computer. So anyone could just go out and buy commodity hardware in bulk and storage in bulk, run it all on Hadoop, and presto, do computation in bulk that produced really fine-grained insights.
Soon enough, Facebook and Twitter and LinkedIn all started building on Hadoop. And that’s why they all emerged together in 2007! It made perfect sense. They had big amounts of data streaming through their business, but they knew that they were not making the best use of it. They couldn’t. They had the money to buy hard drives for storage, but not the tools to get the most out of those hard drives, explained Cutting. Yahoo and Google wanted to capture Web pages and analyze them so people could search them—a valuable goal—but search became even more effective when companies such as Yahoo or LinkedIn or Facebook could see and store every click made on a Web page, to understand exactly what users were doing. Clicks could already be recorded, but until Hadoop came along no one besides Google could do much with the data.
“With Hadoop they could store all that data in one place and sort it by user and by time and all of a sudden they could see what every user was doing over time,” said Cutting. “They could learn what part of a site was leading people to another. Yahoo would log not only when you clicked on a page but also everything on that page that could be clicked on. Then they could see what you did click on and did not click on but skipped, depending on what it said and depending on where it was on the page. This gave us big data analytics: when you can see more, you can understand more, and if you can understand more, you can make better decisions rather than blind guesses. And so data tied to analytics gives us better vision. Hadoop let people outside of Google realize and experience that, and that then inspired them to write more programs around Hadoop and start this virtuous escalation of capabilities.”
So now you have Google’s system, which is a proprietary closed-source system that runs only in Google’s data centers and that people use for everything from basic search to facial identification, spelling correction, translation, and image recognition, and you have Hadoop’s system, which is open source and run by everyone else, leveraging millions of cheap servers to do big data analytics. Today tech giants such as IBM and Oracle have standardized on Hadoop and contribute to its open-source community. And since there is so much less friction on an open-source platform, and so many more minds working on it—compared with a proprietary system—it has expanded lightning fast.
Hadoop scaled big data thanks to another critical development as well: the transformation of unstructured data.
Before Hadoop, most big companies paid little attention to unstructured data. Instead, they relied on Oracle SQL—a computer language that came out of IBM in the seventies—to store, manage, and query massive amounts of structured data and spreadsheets. “SQL” stood for “Structured Query Language.” In a structured database the software tells you what each piece of data is. In a bank system it tells you “this is a check,” “this is a transaction,” “this is a balance.” They are all in a structure so the software can quickly find your latest check deposit.
Unstructured data was anything you could not query with SQL. Unstructured data was a mess. It meant you just vacuumed up everything out there that you could digitize and store, without any particular structure. But Hadoop enabled data analysts to search all that unstructured data and find the patterns. This ability to sift mountains of unstructured data, without necessarily knowing what you were looking at, and be able to query it and get answers back and identify patterns was a profound breakthrough.
As Cutting put it, Hadoop came along and told users: “Give me your digits structured and unstructured and we will make sense of them. So, for instance, a credit card company like Visa was constantly searching for fraud, and it had software that could query a thirty- or sixty-day window, but it could not afford to go beyond that. Hadoop brought a scale that was not there before. Once Visa installed Hadoop it could query four or five years and it suddenly found the biggest fraud pattern it ever found by having a longer window. Hadoop enabled the same tools that people already knew how to use to be used at a scale and affordability that did not exist before.”
That is why Hadoop is now the main operating system for data analytics supporting both structured and unstructured data. We used to throw away data because it was too costly to store, especially unstructured data. Now that we can store it all and find patterns in it, everything is worth vacuuming up and saving. “If you look at the quantity of data that people are creating and connecting to and the new software tools for analyzing it—they’re all growing at least exponentially,” said Cutting.
Before, small was fast but irrelevant, and big had economies of scale and of efficiency—but was not agile, explained John Donovan of AT&T. “What if we can now take massive scale and turn it into agility?” he asked. In the past, “with large scale you miss out on agility, personalization, and customization, but big data now allows you all three.” It allows you to go from a million interactions that were impersonal, massive, and unactionable to a million individual solutions, by taking each pile of data and leveraging it, combing it, and defining it with software.
This is no small matter. As Sebastian Thrun, the founder of Udacity and one of the pioneers of massive open online courses (MOOCs) when he was a professor at Stanford, observed in an interview in the November/December 2013 issue of Foreign Affairs:
With the advent of digital information, the recording, storage, and dissemination of information has become practically free. The previous time there was such a significant change in the cost structure for the dissemination of information was when the book became popular. Printing was invented in the fifteenth century, became popular a few centuries later, and had a huge impact in that we we
re able to move cultural knowledge from the human brain into a printed form. We have the same sort of revolution happening right now, on steroids, and it is affecting every dimension of human life.
And we’re just at the end of the beginning. Hadoop came about because Moore’s law made the hardware storage chips cheaper, because Google had the self-confidence to share some of its core insights and to dare the open-source community to see if they could catch up and leapfrog—and because the open-source community, via Hadoop, rose to the challenge. Hadoop’s open-source stack was never a pure clone of Google’s, and by today it has diverged in many creative ways. As Cutting put it: “Ideas are important, but implementations that bring them to the public are just as important. Xerox PARC largely invented the graphical user interface, with windows and a mouse, the networked workstation, laser printing, et cetera. But it took Apple and Microsoft’s much more marketable implementations for these ideas to change the world.”
And that is the story of how Hadoop gave us the big data revolution—with help from Google, which, ironically, is looking to offer its big data tools to the public as a business now that Hadoop has leveraged them to forge this whole new industry.
“Google is living a few years in the future,” Cutting concluded, “and they send us letters from the future in these papers and we are all following along and they are also now following us and it’s all beginning to be two-way.”
Software: Making Complexity Invisible
It is impossible to talk about the acceleration in the development and diffusion of software without talking about the singular contribution of Bill Gates and his cofounder of Microsoft, Paul Allen. Software had been around for a long time before Bill Gates. It’s just that the users of computers never really noticed, because it came loaded into the computer you bought, a kind of necessary evil with all that gleaming hardware. Mssrs. Gates and Allen changed all of that, starting in the 1970s, with their first adventures in writing an interpreter for a programming language called BASIC and then the operating system DOS.
Back in the day, hardware companies mostly contracted out or produced their own software, with each running its own operating system and proprietary applications on its own machines. Gates believed that if you had a common software system that could run on all kinds of different machines—which would one day be Acer, Dell, IBM, and hundreds of others—the software itself would have value and not just be something that was given away with the hardware. It is hard to remember today what a radical idea this was then. But Microsoft was born on this proposition—that people should not just pay one time for the software to be developed as part of a machine; rather, each individual user should pay to have the capabilities of each software program. What the DOS operating system did, in essence, was abstract away the differences in hardware between every computer. It didn’t matter if you bought a Dell, an Acer, or an IBM. They all suddenly had the same operating system. This made desktop and laptop computers into commodities—the last thing their manufacturers wanted. Value then shifted to whatever differentiated software you could write that would work on top of DOS—and that you could charge each individual to use. That was how Microsoft got very rich.
We now take software so much for granted that we forget what it actually does. “What is the business of software?” asks Craig Mundie, who for many years worked alongside Gates as Microsoft’s chief of research and strategy and has been my mentor on all things software and hardware. “Software is this magical thing that takes each emerging form of complexity and abstracts it away. That creates the new baseline that the person looking to solve the next problem just starts with, avoiding the need to master the underlying complexity themselves. You just get to start at that new layer and add your value. Every time you move the baseline up, people invent new stuff, and the compounding effect of that has resulted in software now abstracting complexity everywhere.”
Think for second about a software application such as Google Photos. Today it can pretty much recognize everything in every photograph that you’ve ever stored on your computer. Twenty years ago, if your spouse said to you, “Honey, find me some photos of our vacation on the beach in Florida,” you would have to manually go through photo album after photo album, and shoe box after shoe box, to find them. Then photography became digital and you were able to upload all your photos online. Today, Google Photos backs up all your digital photos, organizes them, labels them, and, using recognition software, enables you to find any beach scene you’re looking for with a few clicks or gestures, or maybe even by just describing it verbally. In other words, the software has abstracted away all the complexity in that sorting and retrieval process and reduced it to a few keystrokes or touches or voice commands.
Think for another second about what it was like to catch a taxi five years ago. “Taxi, taxi,” you shouted from the curb, perhaps standing in the rain, as taxi after taxi whizzed by with passengers already inside. So you then called the taxi company from a nearby phone booth, or maybe a cell phone, and, after keeping you on hold for five minutes, they told you that it would be a twenty-minute wait—and you didn’t believe what they said and neither did they. Today, we all know how different that is: all the complexity associated with calling, locating, scheduling, dispatching, and paying for and even rating the driver of your taxi has been abstracted away—hidden, layer by layer—and now reduced to a couple of touches of the Uber app on your smartphone.
The history of computers and software, explains Mundie, “is really the history of abstracting away more and more complexity through combinations of hardware and software.” What enables application developers to perform that magic are APIs, or application programming interfaces. APIs are the actual programming commands by which computers fulfill your every wish. If you want the application you’re writing to have a “save” button so that when you touch it your file is stored in the flash drive, you create that with a set of APIs—the same with “create file,” “open file,” “send file,” and on and on.
Today, APIs from many different developers, websites, and systems have become much more seamlessly interactive; companies share many of their APIs with one another so developers can design applications and services that that can interface with and operate on one another’s platforms. So I might use Amazon’s APIs to enable people to buy books there by clicking items on my own website, ThomasLFriedman.com.
“APIs make possible a sprawling array of Web-service ‘mashups,’ in which developers mix and match APIs from the likes of Google or Facebook or Twitter to create entirely new apps and services,” explains the developer website ReadWrite.com. “In many ways, the widespread availability of APIs for major services is what’s made the modern Web experience possible. When you search for nearby restaurants in the Yelp app for Android, for instance, it will plot their locations on Google Maps instead of creating its own maps,” by interfacing with the Google Maps API.
This type of integration is called “seamless,” explains Mundie, “since the user never notices when software functions are handed from one underlying Web service to another … APIs, layer by layer, hide the complexity of what is being run inside an individual computer—and the transport protocols and messaging formats hide the complexity of melding all of this together horizontally into a network.” And this vertical stack and these horizontal interconnections create the experiences you enjoy every day on your computer, tablet, or phone. Microsoft’s cloud, Hewlett Packard Enterprise, not to mention the services of Facebook, Twitter, Google, Uber, Airbnb, Skype, Amazon, TripAdvisor, Yelp, Tinder, or NYTimes.com—they are all the product of thousands of vertical and horizontal APIs and protocols running on millions of machines talking back and forth across the network.
Software production is accelerating even faster now not only because tools for writing software are improving at an exponential rate. These tools are also enabling more and more people within and between companies to collaborate to write ever more complex software and API codes to abstract away e
ver more complex tasks—so now you don’t just have a million smart people writing code, you have a million smart people working together to write all those codes.
And that brings us to GitHub, one of today’s most cutting-edge software generators. GitHub is the most popular platform for fostering collaborative efforts to create software. These efforts can take any form—individuals with other individuals, closed groups within companies, or wide-open open source. It has exploded in usage since 2007. Again, on the assumption that all of us are smarter than one of us, more and more individuals and companies are now relying on the GitHub platform. It enables them to learn quicker by being able to take advantage of the best collaborative software creations that are already out there for any aspect of commerce, and then to build on them with collaborative teams that draw on brainpower both inside and outside of their companies.
GitHub today is being used by more than twelve million programmers to write, improve, simplify, store, and share software applications and is growing rapidly—it added a million users between my first interview there in early 2015 and my last in early 2016.
Imagine a place that is a cross between Wikipedia and Amazon—just for software: You go online to the GitHub library and pick out the software that you need right off the shelf—for, say, an inventory management system or a credit card processing system or a human resources management system or a video game engine or a drone-controlling system or a robotic management system. You then download it onto your company’s computer or your own, you adapt it for your specific needs, you or your software engineers improve it in some respects, and then you upload your improvements back into GitHub’s digital library so the next person can use this new, improved version. Now imagine that the best programmers in the world from everywhere—either working for companies or just looking for a little recognition—are all doing the same thing. You end up with a virtuous cycle for the rapid learning and improving of software programs that drives innovation faster and faster.