Descriptive statistics helps to simplify large amounts of data in a sensible way. In descriptive statistics, we do not draw conclusions beyond the data we are analyzing, neither do we reach any conclusions regarding hypothesis we may make. We claim to present quantitative descriptions of data in a manageable form. Statistics is based on two main concepts. The first is population. It is a collection of objects which we sought the information about. A sample. It is the observed part of the population. Descriptive statistics applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. One of the first tasks when analyzing data is to collect and prepare the data in a format appropriate for the analysis of the samples. This includes four steps. The first is, obtaining the data. Data can be read directly from a file or might be obtained by scraping the web. The second step is parsing the data. There are parsing procedure depends on what format the data are in; plain text, fixed columns, CSV, XML, HTML, and so on. The next step is cleaning the data. Data files are often incomplete and they almost always contain errors. A simple strategy is to remove or ignore incomplete or erroneous records. Finally, the fourth step is to build the data structures. It is necessary to store the data in a data structure that lends itself to the analysis we are interested in. Exploratory data analysis, is an approach to analyzing data sets, to summarize their main characteristics, often with visual methods. The objectives of exploratory data analysis are the following: To suggest hypothesis about the causes of the observed phenomena, to assess assumptions on which statistical inference will be based, to support the selection of appropriate statistical tools and techniques, to provide a basis for further data collection through surveys or experiments. One of the main goals of exploratory data analysis is to visualize and summarize the sample distribution. This will allow us to make tentative assumptions about the population distribution. The absorbed data represents just a finite set of samples of an infinite number of possible samples. The characteristics of randomly observed samples are interesting only to the degree that they represent the population of the data they came from. Given a quantitative variable, exploratory data analysis is a way to make preliminary assessments about the population distribution of the variable using the data of the observed samples. The characteristic of the population distribution of a quantitative variable are it's mean, deviation, histograms, outliers, and others. For a given sample of n values, the mean is defined as the sum of values divided by their number n. See the formula on the slide. The mean can be computed with Pandas by using method mean. Look at the code of the example which computes the average number of non-zero new Covid-19 cases in Belgium and Sweden in terms of their mean. Note that 0 values are excluded. The result is shown at the bottom. The variance describes the spread of the data. It is defined by the formula presented on this slide. It is the mean of squared deviations from the mean. The square root of the variance is called the standard deviation. Here is the example of the code which calculates the standard deviation. The mean of the samples has an important drawback; it is very sensitive to errors. If in the sample set there is an error with a value, which is very different from the rest of the set, we can call it an outlier. The mean will be drastically changed towards this outlier. One solution to remove this drawback is offered by the statistical median, which is an order statistics giving the middle value of a sample. To find the median, we sort the values and take the value from the middle of the observed list as the result. This value is much more robust in the face of outliers. We can compute the median value of new cases for Belgium and Sweden with the code shown here. We observe that they acquired different. Sometimes we are interested in observing how sample data are distributed in general. In this case, we can order the samples and then find such value p that xp divides the data into two parts, where a fraction p of the data is less than or equal to xp. The remaining fraction, 1 minus p, is greater than xp. The value x is the p-th quantile and multiply it by 100, it is called percentile. An important thing to do is to validate the data by inspecting it. Summarizing data, by just looking at the mean, the median, and the variance can be unreliable because very different data can be described by the same statistics. We can have a look at the data distribution, which describes how often each value appears. The most common representation of a distribution is the histogram, which is a graph that shows the frequency of each value. For example, the number of new cases in Belgium and Sweden can be plotted as the histograms with the code presented on this slide. Here are the histograms for our example. To compare histograms we can plot them together, overlapping in the same picture by using the code shown here. It is worth to mention that the parameter Alpha is responsible for the transparency of the pictured data. Now the data can be normalized. We can normalize the frequencies of the histogram by dividing normalizing by the number of samples. The normalized histogram is called the probability mass function. In our previous example, in order to normalize the distribution, we simply changed the named parameter density from 0 to 1. Now we can compare empirical distributions obtained in different conditions. The cumulative distribution function, or just distribution function, describes the probability, the real-valued random variable x will be found to have a value less than or equal to x. With a named parameter, cumulative set to true. We get the distribution functions for the new cases data in Belgium and Sweden. This is how the cumulative distribution functions typically look like. As we mentioned before, the outliers are data samples with the value that is far from the central tendency. To detect the outliers, one can use different rules. For example, find samples that are far from the median, or find the samples whose value deviation from the mean is twice or three times greater than the standard deviation. As the outliers may spoil the characteristics of the population distribution, it is important to analyze the outliers very carefully. It may be necessary to characterize the asymmetry of the distribution. One of the statistical parameters that characterizes the asymmetry of the set of data samples is skewness. The form of a skewness is presented on this slide. Negative skewness indicates that the distribution skews left, it extends from the mean to the left, then to the right. Evidently, that for normal distribution, as well as for any other symmetric distribution, the skewness is equal to zero. Though Pandas Library does not provide a function for skewness, it can be easily computed. Here is an example of the corresponding code, the distributions we have considered up to now are based on empirical observations which are discrete. As an alternative, we may be interested in considering distributions that are defined by a continuous function and are called continuous distributions. In this case, instead of probability mass function, we speak of the probability density function defined by the formula which is presented on this slide. There are many continuous distributions, we will consider only the most common ones, the exponential and the normal distribution. Exponential distributions describe the inter-arrival time between events. When the events are equally likely to occur at anytime, the distribution of the inter-arrival time tests to an exponential distribution. Here are some examples. The time until a radioactive particle decay, the time it takes before your next telephone call, the time until the default (on payment to company debt holders) in reduced-form credit risk modeling. The cumulative distribution function and the probability density function of the exponential distribution are defined by equations shown here. The parameter lambda defines the shape. The mean of the distribution is 1 over lambda, the variance is 1 over lambda squared, and the median is 2 logarithm of 2 over lambda. The normal distribution or the Gaussian distribution represents many real phenomena: economic, natural, social, practically any kind. Here, we list some examples. The cumulative distribution function has no close form expression for the normal distribution. The probability density function is given by the formula which is presented on this slide. The parameter Sigma defines the shape of the distribution.