Monthly Archives: September 2017

Normal Distribution Summary

This post gives the most important points to understand for the normal distribution.  If you want to see the rest my content for statistics, please go to this table of contents

 

Normal Curve Basic Information

  • The standard normal curve is a probability distribution, which means the total area under the curve is 1.0

Probability Distribution of Normal Curve

  • Like all probability density functions (PDF), it can be done as a cumulative function, i.e. sum as you go, to show a cumulative density function (CDF)

Normal Curve Cumulative Probability

  • You more frequently see the normal curve plotted as a probability density function (i.e. the bell curve). But most of the time when you actually use it, such as to look up the probability of something being more than 2 standard deviations away from the mean by using a Z table, you are actually using the cumulative density function.
    • To put it a different way, we more frequently care about the area under some section of the curve (the CDF at one location minus the CDF at another location) than the actual probability of any specific point on the curve.

 

The Real Life Meaning Of The Normal Curve

  • A normal curve has a physical meaning in real life
  • If you take multiple measurements of different samples, some results will end up high, some will end up low, and most will fall in the middle. The shape that results will likely be the normal curve
    • The classic example of this is if you measure the height of a group of plants, i.e. how tall are different stalks of corn
  • The normal curve also has an easy to duplicate mathematical meaning
  • If you take a large number of independent trials of an event with 50% probability, the resulting shape will be a normal curve
    • I.e. if I flip a coin 20 times and count the number of heads, and I repeat that test many times, the resulting histogram will be similar to a normal distribution
    • The chart below shows the binomial distribution of 20 trials with a 50% likelihood repeated, vs the normal distribution using the same mean and standard deviation.

binomial distribution vs normal distribution

  • The more times I do that test to more similar the binomial distribution will become to the normal distribution
  • This is the same reason the normal distribution exists in real life
    • For instance, if we assume that a plant’s height is determined by 20 genes, and the genes the plant receives could be either tall genes or short genes, then the number of tall genes the plant gets is basically equivalent to saying how many heads will you get in 20 flips. It will follow the same distributions, which approaches the normal curve as the number of trials increases.

 

Using A Normal Distribution

  • The normal distribution uses standard deviations as a way of fitting itself to any set of data. The standard normal distribution assumes that the standard deviation of the data set is 1.0.  You can either stretch / shrink the standard normal distribution to match your data, by multiplying it by the standard deviation, or you can shrink / stretch your data to match the standard normal distribution by dividing by the standard deviation.
  • What we usually care about with a normal distribution is area under the curve up until a certain number of standard deviations
    • For instance, if we want to know what percentage of the total area fall between -2 to +2 standard deviations, we would take this area

how to use a normal distribution

And we would see that 95.4% of the data falls within 2 standard deviations of the mean

  • If we want to know what percentage was to the left of -1 standard deviations we would take this area

left side of the normal distribution

And see that 15.9% of the data is greater than 1 standard deviation to the left of the mean.

  • Frequently this is done using a normal table, also known as a Z-table
    • There are lots of good references on how to use a Z table, including this youtube video.

 

 

Level 2 Information – I recommend you get familiar with the basics of some of the other statistics topics before coming back and revisiting this in greater depth.

  • Recall that the normal distribution can be generated as a series of binary events done an infinite number of times. The math of understanding binary events is the binomial distribution.  One way to approximate the binomial distribution is a normal curve. This is because the binomial distribution is just the normal curve with a finite number of events instead of an infinite number of events.
    • A rule of thumb for when the normal curve is a good approximation for the binomial distribution is when the minimum of either the odds of success or failure (either p or (1-p) ) in a single trial are greater than 10. e. if I have a 50% chance of success and do 10 trials, then 0.5 * 20 = 10, so this would be a good approximation
    • If I had an 80% chance of success in a single trial, and do 12 trials then I take the minimum of the odds of success (80%) or failure (20%). So this would be .2 * 12 = 2.4, so the normal curve would not be a good approximation of this binomial distribution.

binomial distribution vs normal distribution - not a match

  • The normal curve is a key component in testing for statistical significance, also known as hypothesis testing

 

 

Level 3 Information

  • Not everything is a normal distribution. Some real-life events have infrequent outcomes that are more likely to occur than a normal distribution would predict.  One example of this is the financial  Occasional crashes happen with more severity than would be predicted by a normal distribution
  • There are different ways of measuring a probability distribution to see how similar it is to the normal distribution
    • Kurtosis is a measure of how much probability is at the center of the distribution vs at the tails
    • A normal distribution has a kurtosis of 3.0, something with a fat tail would have a kurtosis greater than 3.0. This cross validated answer has a good visualization of kurtosis
  • Skew is a way of measuring if the distribution is centered or not. A normal distribution is exactly symmetric.  However, a probability distribution can also be skewed left or skewed right.
    • Skewed left means that the left tail is a lot longer than the right tail. This is a skewed left distribution

skewed left normal distribution

  • Skewed right (also known as positive skew, because there is more area to the right of the mean) means that the right tail has a lot more area than the left tail. This is a skewed right distribution

skewed right normal distribution

  • The normal distribution also has an equation, which is

Normal Distribution Equation

  • Where µ is the mean of the data
  • And σ is the standard deviation
  • This is the probability distribution of the standard normal curve. Plugging in a mean, a standard deviation, and a value that you are looking for in this equation is the equivalent of using the NORM.DIST() function in Excel

 

Next Topic

You may want to look at Hypothesis Testing next, which heavily uses the normal distribution

Statistical Significance Summary

This post gives the most important points to understand for statistical significance.  If you want to see the rest my content for statistics, please go to this table of contents

What Is Hypothesis Testing – In 3 Sentences

Hypothesis testing is a way of determining if some measured effect is a real difference, or likely just statistical noise.  The baseline belief is always that any difference in measurements is just statistical randomness.  We use the hypothesis testing equations to demonstrate that any measured differences are large enough that they are very unlikely to be merely random variations.

 

Hypothesis Testing – As An Image

Hypothesis testing is essentially placing an error band (the bell curve below) around a point that you measured (orange dot) by using a modification of the normal curve, determining where another point would be located on that chart, and seeing how much area is under the modified normal curve up until that location

basic hypothesis testing

The width of the curve can change as you get more data

hypothesis testing with a narrower width

And sometimes you have error bands around both points

two sample t test error bands

When Is Hypothesis Testing Used?

Hypothesis testing is used in scientific studies of all kinds to determine if an effect exists.  This is synonymous with the term “Statistical Significance”.   It is used, for instance, to show the difference between a real medicine and a placebo.  This is also used in things such as A/B tests for advertising to determine which ads are most effective.

Hypothesis Testing In More Detail

  • Hypothesis testing always has two sets of measurements. (i.e. measure 10 samples from over here, and 15 samples from over there) Each of those two sets of measurements will have some average value.  So there are always two averages.
  • Those two averages will always have some difference between them.
    • Sometimes that difference is very large, i.e. if I measure the maximum weight lifting ability of a group of people before they start training vs. after they spend a year training
    • Sometimes the difference is very small. Sometimes the difference can be so small it is within the precision limits of the data and shows up as zero.
  • Hypothesis testing is determining if
    • There is some systematic cause which results in the observed difference between the two averages or
    • If it is likely that the observed difference is solely due to statistical noise, i.e. the typical fluctuations in results you get when you take measurements. This is the “Null Hypothesis”
      • Example – I have a coin that I know is a fair coin and will thus come up heads 50% of the time. I flip it 100 times and get 53 heads.  The difference between 53 heads and the expected 50 heads is small enough that it is probably statistical noise rather than “Someone gave me a weighted coin”  Hypothesis testing is a way of putting concrete numbers to the statement “Probably statistical noise”
    • Our default assumption is the “Null Hypothesis”. We assume that any difference in the average results is merely statistical noise until we show that to be more unlikely than a certain threshold.
      • We get to decide what we want that threshold to be. A typical value is that there must be less than a 5% the results are merely random noise before we assume that the results are systematic differences.  (Less than 1% chance, and less than 0.1% chance are also common thresholds used)
    • Note, even if we determine that there is a systematic difference, our calculations won’t tell us what is causing the difference in the average value between the two sets of data, just that there is at least one systematic difference
      • e. if we are confident that certain groups of people are stronger after a year spent lifting weights, that doesn’t tell us if the training actually caused the difference. It could have been something different like secret steroid usage.

 

 Hypothesis Testing Equations

  • There are 5 different types of hypothesis tests, each with their own equations. However don’t get hung up on that yet, they are only small differences between all 5 of the tests.
  • All hypothesis tests compare two sets of data. Call them the baseline set and the test set.
  • Each of those two sets of data has 3 attributes, for a total of 6 attributes.
    • The first attribute is the average of that set
    • The second is the standard deviation of that set
    • The third is the number of measurements in the data set
  • There are 5 different types of hypothesis tests (1 Z-test, and 4 T-Tests) and the only reason there is more than 1 type of hypothesis test is that they all make different assumptions about the 6 attributes. It isn’t important to know all the different assumptions yet, but here are some examples
    • One of the tests assumes that you don’t know anything about any of the 6 attributes other than what you measured
    • One of the tests assumes that the only thing you know is that both sets have the same standard deviation
    • Another test assumes that you have infinite measurements of the baseline set. For instance, you know the average height of people in a certain state with certainty because you looked it up in the government census results
    • Other tests have different assumptions. It isn’t important to know any of these yet other than to know that the equations are doing the same thing with different assumptions
  • In most cases, all 5 different types of hypothesis tests will give a similar answer. This is a good thing, and it means that if you understand how any of them work, you basically understand all of them
  • This free PDF cheat sheet has the equations for the 5 different types of hypothesis tests, as well as an example of when you would use each one.

 

Hypothesis Testing And The Normal Curve

  • It is pretty important to have some knowledge of the normal curve. (i.e. bell curve), as well as a general understanding of what a standard deviation is.   See this blog post for an overview.   (blog post TBD)
  • Remember that we have a baseline set of data, and a test set of data. Each of those sets has an average, a standard deviation, and a number of measurements.
  • What hypothesis testing is all about is “How well do we know the average values of the populations that we took our measurements from?”
    • e. even though we have an average value of our measurements, that is just the average of the samples we took, not the full population
    • There will always be some difference between our sample average and the true population average
    • Our average has a range of uncertainty, and we can use the normal curve and standard deviation to quantify that uncertainty
  • This blog post shows how to quantify the uncertainty you have in your average value. The examples given are dice outcomes that you are already familiar with.  (i.e. that the most likely roll using 2 dice is a 7) http://www.fairlynerdy.com/intuitive-guide-statistical-significance/

 

T-Test Degrees Of Freedom

  • Compared to the Z-test, T-tests have an additional equation, where you calculate the degrees of freedom.
  • Degrees of freedom is just a way of determining how many overall measurements you have in your data set, which is used to determine how accurate your calculated standard deviation likely is.

 

Hypothesis Testing Examples

 

 

 Level 2

That’s it for the first block of information.  If you work through a couple of examples and understand most of those points you will have a good grasp of hypothesis testing.  I recommend coming back and learning this second section in a few days or a week.

 

T Distribution vs Z Distribution

What is the difference between a T-Distribution and a Z distribution?

  • The point of having any distribution at all is that there is a range of values that your average could be. e. having a normal distribution accounts for the fact that you don’t know your average exactly.  But you also don’t know the standard deviation of your data exactly.
    • To be more precise, you know the standard deviation of your measured data. However, you don’t know the exact standard deviation of the population it was drawn from
  • A Z distribution uses a normal curve and ignores any uncertainty in your standard deviation. This is because it assumes you have enough data that your standard deviation is quite accurate.
  • A T distribution takes into account the fact that you have a range of error in your standard deviation. It does this by changing the shape of the distribution
  • It based the shape of the curve on the degrees of freedom, which is a way of calculating how many measurements you have.
  • With a T-distribution, Instead of the typical normal curve, you get a curve with fatter tails
    • This applet lets you play with the shape of a T distribution vs a Z distribution assuming different number of samples http://rpsychologist.com/d3/tdist/
      • To use it, slide the slider above the charts to the right or the left to change the degrees of freedom assumed in the T-Distribution
      • This will change the shape of the T-distribution
      • You will see that with a low number of degrees of freedom, the T distribution has much fatter tails than the normal distribution. However as the number of degrees of freedom  (i.e the number of measurements, i.e. the confidence you have in your measured standard deviation) increases the T distribution becomes nearly identical to the standard normal distribution
    • Once you get around 30 data points, or so, the difference between the T Distribution and the Z distribution mostly goes away, which is why 30 degrees of freedom is a common rule of thumb on when to switch to a Z-test instead of a T-test. (Note, there other important considerations as well, such as whether you have a baseline measurement or are measuring both the baseline and the test sets of data)

 

When To Use Each Test

This block summarizes when you would use any given test.  As we go down this list we know less and less information and have to rely on what we measure.   I.e. instead of looking up from the government census what the average age of a region is  (i.e. knowing it), we go ask 100 people what their age is (measure it)

 

Z Test

  • You know baseline average
  • You measure sample average
  • You know baseline standard deviation
  • You assume sample standard deviation is the same as the baseline standard deviation

1 Sample t-test

  • You know baseline average
  • You measure sample average
  • You don’t care about baseline standard deviation  (because we have so many baseline samples that it doesn’t matter)
  • You measure sample standard deviation

2 Sample t-test – equal variance

  • You measure baseline average
  • You measure sample average
  • You measure both sample standard deviation and baseline standard deviation but assume that they are the same as each other so you just measure them all together and do the calculations as a group

2 Sample t-test – unequal variance

  • You measure baseline average
  • You measure sample average
  • You measure baseline standard deviation
  • You separately measure sample standard deviation

The last hypothesis test is slightly different because the previous ones all assumed that what you were measuring was different groups.  The paired t-test assumes that each data point in the two sets of data are tied together.  I.e. each data point measuring the same people before and after

Paired T-test

  • The average value is the average of the difference between the before and after data
  • The stander deviation value is the standard deviation of the difference in before and after values

 

Truthfully, in many cases, you aren’t going to get very much difference no matter which of these equations you use.   Some of the equations pretty much reduce into the other equations as you get more and more information, i.e. if you measure at least 30-50 data points the difference between the 1 sample T-test and the Z test is pretty small.

 

 

Level 3 – Morphing Equations Into Each Other

It turns out that many of the equations for the 5 different hypothesis tests are just simplifications of each other.  They can be morphed into each other as you make assumptions, such as that the number of measurements goes to infinity for one of the datasets.

  • Blog post on this to be done.

How To Learn Statistics

This section of the website covers statistics and has most of the same topics that would be covered in a statistics 101 course at a University.  Here is the table of contents for the different statistics topics covered on this site.

Two good examples of free University content for this material are

Another great resource for statistics help is Cross Validated.  There you can ask questions about your specific problems and get help.

 

Purpose Of This Page

The reason I wrote this material came from my own pursuit of the same information.  I am about 10 years removed from studying aerospace engineering in college and have been working as a professional engineer since then.  However, even though I studied statistics at the university, a lot of the concepts proved to be slippery and a “use them or forget them”.  This is especially true of things I knew mostly by memorizing equations.

Remembering equations worked fine for carrying me through a semester of school, but to remember the topic for a whole career or a lifetime, it turns out that I needed something different.  I needed to find ways relate each topic other things that I knew, so it wasn’t on the periphery of my knowledge anymore, but deeply ingrained.

 

Benefits & Perils Of Self-Teaching

These topics are my attempt to cover statistics in a way that is easy to learn and remember long term.  A lot of this material I re-learned through self-study.  Being largely self-taught in these topics often makes it easier for me to find relatable examples for other learners than it would be for the complete expert.

The downside of self-study is that it can leave gaps.  For instance, when learning how to do multiple linear regression (i.e. draw a straight line/plane/hyperplane through data that has at least 2 independent variables) I ran across a method of doing it that was an intuitive but laborious expansion of simple linear regression.  A reader who is more of an expert than I pointed out that I had missed a simple to do (but difficult to understand) method of doing multiple regression  (i.e. Moore-Penrose pseudo inverse).

If you spot similar oversights or just plain errors, you can find my email address here, please let me know, and I can include that information for future learners

 

How To Learn Statistics

I think that statistics is an area where you want to go for breadth-first.  I.e. You would rather know something about a lot of topics than knowing a lot about a few topics.   My friend Kalid over at Better Explained refers to it as the Adept Method.   And a good analogy is to go for seeing a big picture first, even if it is not completely clear

You see the whole thing, but it is fuzzy

More study gradually brings it into focus

more study brings it into focus

Rather than diving deeply into specific topics before moving on.

From a practical sense, what that can mean is skimming topics on an initial read, than then revisiting the topics a couple of times as you learn other material and see the connections between them.  To facilitate this, I have set up each page with a couple different sections.  The first section of each page is designed for an initial read through to get the big picture, and the second or third sections are designed to draw connections to the other topics and to dive into a deeper understanding.  I recommend reading the first section of any given topic and then coming back a week so so later after you’ve had time to internalize some of the material, do some example problems, and look at some other topics, before diving into the second or third layer of the material

 

Sources Of Material

The internet has a lot of great content on it, and it doesn’t make sense to duplicate a bunch of material that already exists.  So when I know of pages that have great explanations or find useful tools, I will link to them.  And if you have suggestions please let me know.    I have also covered some of the other topics in more depth as Kindle Books (typically priced under 3 dollars), so I will link to that material where it makes sense.  I know that some people can’t access the Kindle content or are not in a position to purchase it, so I would be happy to send a free PDF copy if you contact me.

 

Enough Preamble

That’s all I have by way of introduction.  To get started go to