Learn statistics fast before entering to Data Science

This is an overview of statistics topics that you will need before starting your career

Erick Duran
Analytics Vidhya

--

Photo by Carlos Muza on Unsplash

Probably you are interested in starting a career in Data Science but you don’t know where to start before learning how to program. The main goal of Data Science is to use tools for the analysis of data to have insights and do predictions. For this goal, you need a background in statistics. But what if you don’t have time to start a long statistics course and you just need the basic knowledge? I’m going to give you a fast overview of the important statistics topics that will save you time. This overview focus on probability, distributions, and regression but doesn’t include calculations.

Probability and distributions

Probability is the first topic you should learn in statistics, it’s the base of everything. Probability is the chance that an uncertain event will occur. Its measurement is always between 0–1. In probability, you should learn important terms such as — Random experiment, Basic Outcome, Sample Space, and Event. A random experiment is a process leading to an uncertain outcome. A basic outcome is a possible outcome of a random experiment. A sample space is the collection of all possible outcomes of a random experiment. An event is any subset of basic outcomes from the sample space. If A is an event in the sample space S, then the probability should range between 0–1. The summation of all events should equal 1.

Probability has two rules — the complement rule and the addition rule. The complement of an event equals 1-P(A). The addition rule is the probability of the union of two events (A or B). The union of events A and B equals the sum of events A and B minus P(A and B).

What if we want to combine events? This is called conditional probability —A conditional probability is the probability of an event, given that another event has occurred. This is calculated through P(A and B) divided by the event that has occurred. Two events are statistically independent if and only if P(A and B) = P(A)P(B). Events A and B are independent when the probability of one event is not affected by the other effect.

Probability has different types of distribution such as discrete and continuous. For continuous distributions, there are — uniform, normal and exponential distributions. With continuous random variables, you can assume any value in an interval. A random variable represents a possible numerical value from a random experiment. A normal distribution is a bell-shaped function that contains probabilities of continuous variables and is the most common type of distribution. Apart from being bell-shaped, it’s symmetrical and its mean, median, and mode are equal. By varying the parameters of the mean and standard deviation we obtain different normal distributions. The shaded areas under the curve are the probabilities between ranges. The probability between ranges is given by the integration of the probability density function. It’s important to understand that in statistics we can sample data instead of using the entire population. Sampling data is less consuming, less costly and it’s possible to obtain statistical results of high precision. If you have a sample where the number of observations is large you can reach a normal distribution, this is called the central limit theorem.

The next step is to analyze the probabilities of our data, for this we create a hypothesis according to the results we want to have, but before going to Hypothesis testing we have to understand what is the confidence interval. A confidence interval is an uncertainty that is associated with a point estimate of a population parameter. Confidence interval also provides additional information about variability.

Hypothesis testing is one of the most important topics in statistics that helps you test your results. In hypothesis testing, you have a null hypothesis and an alternative hypothesis. The null hypothesis states the assumption to be tested and is always about a population parameter. The alternative hypothesis challenges the null hypothesis and it’s the hypothesis that generally the researcher is trying to support. For hypothesis testing, we should use the level of significance which refers to the unlikely value of the sample statistic when the null hypothesis is true, this value is chosen by the researcher at the beginning and determines the rejection region of the sampling distribution. We now use the p-value that is the probability of obtaining a test statistic more extreme than the observed one assuming the null hypothesis is true. If the p-value is lower than the level of significance we reject the null hypothesis, if its greater than the level of significance we accept the null hypothesis.

Does a categorical variable conform to a hypothesized distribution? for this we have to see if the observed results are consistent with the expected results given that null hypothesis—this is called the chi-squared goodness-of-fit test. The calculation for this is done by the squared sum of the difference between the observed frequency for category i minus the expected frequency for category i. We reject the null hypothesis if the result is greater than the critical value. We can also create contingency tables to classify sample observations according to a pair of categorical variables. To do an analysis of contingency tables we do a test of independence were at a significance level is based on the chi-squared distribution and the following decision rule.

Simple and Multiple Regression

Finally after learning how probability and distribution works to understand how the data is distributed we can start learning about regression. Regression analysis is typical used to explain how changes in variable Y are related to variable X and to predict the value of a variable Y based on the value taken by a variable X. Variable Y is dependent as we wish to explain or predict. X is a independent variable used to explain or predict, we also call it explanatory variable. It’s important to know that Regression measures correlation which is not enough to prove causation.

This is an example of Linear regression

For simple regression we use only one independent variable to explain correlation with the dependent variable. The equation of the regression model is given by Yi = b0+b1Xi where b0 is the intercept estimator and b1 is the slope estimator. Differential calculus is used to obtain the values of the estimators and this can be calculated using a computer using excel or statistical analysis software.

The coefficient of determination is portion of the total variation in Y that is explained by the variation in X. The coefficient of determination is also called the R-squared. The coefficient of determination for a simple regression is equal to the square of the sample correlation between Y and X. We have a perfect linear relationship between X and Y when the value of R-squared is equal to 1, this means that 100% of the variation in Y is explained by the variation in X. When 0<R-squared<1 there is a weaker linear relationship between X and Y meaning that some but not all the variation in Y is explained by the variation in X. When R-squared is equal to 0 there is no relationship between X and Y.

If we want to create a good regression model, usually simple regression is not enough as there must be other independent variables that may have relationship with the dependent variable. This is called multiple regression model. The coefficients of the multiple regression model are estimated using sample data and can be obtained using computer software.

The estimated regression equation with k independent variables is: Y=b0+b1X1+b2X2+…+bkXk

In order to accept the independent variables for the regression model we should observe the p-value presented with the software calculation. If the p-values of each independent variable is lower than the significance value we reject the null hypothesis meaning that the independent variable is significant in our regression model.

If we want to use categorical variables in our regression model we can categorize them into numerical values. An example is the use of Dummy variables where the variable has two levels 0 and 1. When using dummy variables we should avoid multicolinearity problems. This is when the variables are redundant and the sum of two variables equals to 1 and our estimation will give error. When this happens we should drop one of the redundant variable.

Photo by Franki Chamaki on Unsplash

All of this information might be overwhelming but it’s really important to understand at least the basic in order to understand your data. This post is not supposed to get deep into statistics as there are lots of additional information such as calculations and other factors that explains in detail the topics, but this post will help you start getting confident in order to start your career. There are lots of other available information that you can find online in order to get deeper into the topics and hopefully this post will give you an excellent overview of what you should know in order to start.

Data science is a complex career and there are lots of topics that you are going to learn in order to get better and master your skills. Just be patient and be motivated in learning more. Statistics gives you a better understanding of your data. Knowledge gives you more value to your work.

Follow me for more

--

--