[MUSIC] Hello, my name is Eddie Valaitis and I am a Director in PwC's Advisory Pharma and Life Sciences Analytics Practice. Before joining PwC, I was a tenure track professor of statistics at American University. Now I focus on farmer research and development, applying my theoretical statistics and machine learning knowledge to a diverse set of complex issues facing our by pharma clients. These issues range from wrangling and standardizing large datasets. Accelerating drug development by identifying those spacious of populations more responsive to particular treatments. Accessing toxicity across specific drug platforms and developing, and implementing enterprise wide clinical trial modernize solutions. In this video, I'm going to talk to you about our R in RStudio. I'll describe what the tools are, what they are used for and I'll share a client example where R was used to solve a business problem. Why do we like R? R is an open-source software environment and language for statistical computing, and graphics. Open-source means that the language is freely available and may be distributed and modified by anyone, so that R has a very active contributor and user base. R is the leading statistical analysis package, as it allows the import of data from multiple sources and multiple formats. For example, flat files, SAS files and direct connect to graph databases. In addition to data management capabilities, R contains over 7,000 specialist packages that are all free. These packages allow users to employ cutting-edge statistics, econometrics, optimization, machine learning and simulation techniques. These all makes are the leading analytics language in academia and industry. There's also a set of integrated tools design to help you be more productive with R. It's a powerful and user-friendly graphics user interface called RStudio. RStudio includes a console, syntax highlighting editor that supports direct code execution as well as tools for plotting, history, debugging and workspace management. More specifically, RStudio includes integrated R help and documentation. The ability to easily manage multiple working directories using projects. A workspace browser and data viewer, and and interactive debugger to diagnose, and fix errors quickly. RStudio allows users to annotate code more fully. Debug code more easily. View data in spreadsheet format and see all objects defined in a session. What's more? RStudio allows users to employ packages for automated Word or PDF report generation and also contains a powerful visualization plugin called Shining. Shiny allows development of iterative and flexible visualizations, and dashboards that can be accessed by non-programmers via internet browsers. As mentioned earlier, R is a powerful and flexible analytics tool that can be used to implement various data management, statistical analysis and machine learning tasks. We could spend years going over all the years all of the methods implemented in R, but I will highlight some of the most commonly used R features. For data management, R is used for importing data, assessing data quality by identifying data qualifiers, implausible information and missing observations, merging data as well as stacking data. In summary statistics, R helps create routine summary statistics describing distribution of a variable. For example, means, medians, standard deviations and skewness. For statistical tests, R produces means tests, proportion tests, association tests, distributional tests and tests of stationarity. In regression models, R is used for ordinary regressions, logistic regression, stepwise regressions, ridge regressions, analysis of variance, multivariate adaptive regression splines, support vector machine nonlinear regressions. For clustering, R is used k-means, partitioning around medoids and hierarchical clustering. In dimensionality reduction, R works in principal components and multidimensional scaling. And in decision trees and random forests, R supports classification, regression and survival trees and random forests. The flexibility of R makes it the top choice for executing complex data management and statistical analysis client engagements. For example, a leading by far more organizations sought our help in aggregating and standardizing data across a drug platform. Our goal was to identify tumor gene expression cut points associated with differentiated survival outcomes to develop an interactive visualization solution for cut point identification and validation. R was a natural choice for this engagement as it allowed us to quickly import assess, standardize and aggregate data that were messy and incongruous. For example, data had missing variable labels, missing values, multiple unit measurements for the same variable across trials. We then wanted to identify important genes associated with survival after accounting for confounding clinical and demographic covariance using the random forest technique for survival objects. We then identified gene expression cut points using survival trees. We then followed it up with cogs proportional hazards and a number of more advanced barometric survival models to assess whether the identified cut points were associated with differentiated survival outputs. To bring all of the analysis to life, we use an interactive Shiny visualization that then allowed the client to independently identify genes in their cut points without having to program an R. As you can see in our examples, R is an important tool for data management, analytics, statistical modeling and machine learning. So, we've covered a lot of materials in this segment. If some of the terms were not familiar to you, be sure to reference the course glossary. [MUSIC]