Distributions
Now that we’ve seen the central point of a distribution, and then the deviation of data from the central point, let’s explore how these distributions look like.
Before we explore the distributions, its necessary to understand a few key terminologies
Skewness
Skewness is the lack of symmetry of a distribution
- Positive skewed = data is heavy towards the left ; tail towards the right
- Negative skewed = data is heavier towards the right ; tail towards the left
Kurtosis
Kurtosis simply refers to the pointiness of the curve. The spikier the curve, the more the kurtosis value is.
- Positive kurtosis = leptokurtic
- Negative kurtosis = platykurtic
Normal / Gaussian / Continuous Distribution
Most commonly occurring distribution of data where the data is distributed symmetrically around the center. The bell curve is a classic example of normally distributed data.
In a normal distribution:
- Mean = Median = Mode.
skewness = kurtosis = 0
Normal Distribution Data
Percentage | μ ± s.d. |
---|---|
68.3 % | μ ± 1σ |
95.5 % | μ ± 2σ |
99.7 % | μ ± 3σ |
Binomial Distribution
A binomial distribution is a discrete distribution used in statistics. It only counts 2 states typically 0 and 1.
Uniform Distribution
The uniform distribution consists of similar values throughout.
Skewed Distribution
Data distributed which spikes towards either ends as opposed to the central spike in normal distribution (skewness: lack of symmetry)
Data dist type | Measure of Central Tendency | Measure of spread (variation) |
---|---|---|
Normal | Mean | Standard Deviation |
Non normal (skewed) | Median | Range, Percentile & IQR |
To convert any dataset with any mean & std deviation to a dataset with mean = 0 & std dev = 1.
Can be done using z-scores: z = (x - x̄) ÷ s
Central Limit Theorem
The distribution of the sample means tends towards a normal distribution as the number of samples increase.
Below is a step by step explanation of what the Central Limit Theorem does.
- Take a random population and plot its distribution (assuming distribution ≠ normal distribution)
- Samples of constant size n are to be taken from the population (n1 = n2 = n3 …)
- Take a random sample of size n1 from the population and calculate its mean x̄1
- Take a random sample of size n2 from the population and calculate its mean x̄2
- Plot the means x̄1, x̄2, x̄3 …
- The sampling distribution (plot of x̄1, x̄2, x̄3 …) tends to be a normal distribution as the number of samples increases
- The variance of the sampling distribution = The variance of the population / sample size σx̄2 = σ2 / n
- With a higher value of n, the sampling distribution tends to have a lower variance (high kurtosis), and vice-versa
- The standard deviation of the sampling means is called Standard Error of the Mean.
How Do We Transform Skewed Data?
Since you know how much the skewed data can affect our machine learning model’s predicting capabilities, it is better to transform the skewed data to normally distributed data. Here are some of the ways you can transform your skewed data:
- Power Transformation
- Log Transformation
- Exponential Transformation