So, Predicting the Future. When you get out again, out into the working world and in the spirit of this class, is to introduce you to new language, new vocabulary, new terminology that you can take with you out into the working world to better prepare you for job interviews and your first day on the job, first week on the job, first year on the job, so to introduce you to some terminology. So, there's an area called Prognostic Health Management or abbreviated PHM. Prognostic means an advanced indication or portent of a future event. The design production and operation of machinery around the world is a multi-trillion dollar global industry and this Prognostic Health Management notion has developed over the last couple of decades. There's a link for you, there's a prognostichealthmangementsociety.org and you can go up and read all about Prognostic Health Management as you see fit. But the central idea is to service machines in a production line when they need it and not according to some conservative time-based schedule. So, preventative maintenance for years was done on a time-based schedule, and this applies to like in the power plant example we saw on the first week in production machinery equipments, lays and stamps, and welding machines and robots that are used for automotive business is to assemble cars and so forth. For years, it was a company would buy a piece of machinery and it came with a service manual and a service schedule and it said, just like changing the oil in your car. Every 3,000 miles, we are supposed to change oil in our car. All right. So, servicing these machines that well, these bearings need to be replaced every three months. Whether they really need to do or not and this hydraulic piston needs to be replaced once a year, et cetera. So, there's this notion of this fixed interval. They found by using and studying the data from sensors, monitoring these devices that they could predict when a piece of equipment was going to fail. So now, that's a huge deal to manufacturing businesses because it reduces their operating costs, they are not replacing bearings on this machine every three months like they used to, they are replacing the bearings when the predictive analytics says, "Hey, the vibration amount on this bearing is starting to go up." What we've seen in the past is, once it crosses through this threshold within two weeks, it's going to fail. So, set the machine down, you go in and change that bearing on that machine and bring it back up again and then off you go. That might be every six months or every nine months, instead of every three months, and so it's saving the company money. That's the idea behind this Prognostic Health Management. In 2015, GM announced this deployment of this technology that aims to determine whether certain vehicle components needed service and maintenance because of possible future failure, the data sources that went into their Prognostic Health Management was manufacturing data, engineering data, all their internal data from all of their testing that they did stress tests. I talked about thermal stress testing but there's other types of testing like mechanical, shock testing and vibration testing and so forth. So, there's a lot of manufacturers generate lots of data over time that's internally generated. Field data, anything got from vehicles, oil temperature, manifold pressure, vibrations, acoustic noise, all these resources of data, it could feed into health management system to predict a failing component in an automobile. They use linear regression, they use nonlinear regression which we didn't talk about, and they use support vector machines. It was successful, I probably should have had in here some data to show how successful it was but my understanding was it was successful. So, moving along, this is a quick look at Hadoop or at least where it was a year ago. So, at the bottom of this is the Hadoop File System or HDFS and then there's this thing called YARN that sits on top of it and here's the MapReduce process and then, there's this data processing portion of Spark and then, there's a machine learning library of Spark to perform predictive analytics. YARN stands for Yet Another Resource Negotiator, it's relatively new and it's responsible for servicing client request by allocating processes that they have called in their language containers on physical machines and managing these containers. MapReduce is a two step process, the first is called Map and the second is called Reduce. So, the map function runs on a master node and it divides a query request into subtasks, which is then distributed to worker nodes that process a subtask and pass the results back to the master node. The Reduce function collects all of the results of all the subtasks and then combines them to produce this aggregate result as an answer to some original query that was made to the system. I've got an example here, so let's say, we want to compute the number of social friends that Bob has and arrange them into geographical locations. Our database has 2 billion people in this, that we're going to give to the MapReduce process and it contains their addresses and their connections, who's connected with who, I think Facebook or LinkedIn for example. One possible MapReduce implementation could look like this. We want to compute the average number of social friends and arrange those friends by geography. So, now, remember we had two 2 billion records in our database, so we could divide the problem up into subtasks, where each subtasks handles 1 million data records each. Each Map subtask procedure produces multiple records of a country and account for their connection to Bob. So, for example, Bob has 25 connections in France, has 67 in Germany, he's got 31 in India, et cetera. In the Reduce phase, Hadoop assigns a task to a certain number of these processors to accumulate the final results and in my cartoon example here, I had the results sorted in ascending and descending order. So, had the fewest number of connections in France at 25 and the highest at 67, that's the idea behind what Map and Reduce can do, one example of what Map and Reduce can do. ApacheSpark, same showed you its more recent, IBM, this is a quote from IBM, from IBM's website here and this was from three years ago. They were commenting on Apache Spark and they called it the most important new open source project in a decade that is being defined by data. So, it's the next generation of data analytics processing engine that can coexist with Hadoop. Way to think of it, it's an extension to MapReduce that adds two main primary features beyond what the MapReduce process. The first one is interactive queries in lieu of data explorations that could take hours using MapReduce. So, in our example we had 2 billion data entries and we were given a million entries to each task and that could take a while to process. So, Spark allows these interactive queries to run much more quickly and MapReduce worked on a fixed static database, like in the Oracle example, we extracted features and we translated them and we loaded and put them someplace and then, we could stick MapReduce on it to try and extract information that we were interested in extracting from that dataset. So, the second ability is, Spark can process streaming data as things are happening real-time. As information is flowing from a power plant, in my power plant example to the Cloud, we can be looking at it as a data streaming body. Spark's main data abstraction is based on what's called a Resilient Distributed Dataset, they call them RDDs, that's parallelized collection of data, when I think about it, I think about a matrix or a multidimensional matrix. Spark's main components, Spark Core, it consists of the primary APIs that the programmer uses to call and handles task scheduling and all the memory management and there is Spark Streaming which enables the application process streaming data which was the enhancement over the MapReduce process. Spark SQL or enabled queries unstructured data, Spark MLib, this is a machine learning library and also contains a number of methods or API calls for dimension reduction and it wouldn't surprise me for analyzing the dimensionality, especially things like k-means and other methods in there. There's a graph creation, GraphX, which I took a very quick look at and it it looks interesting for viewing a multidimensional data sets. Anyway, it all looks very cool to me, so at some point I plan to go play with it.