The Python Pandas library provides a function to calculate the standard deviation of a data set. Let’s find out how.
The Pandas DataFrame std() function allows to calculate the standard deviation of a data set. The standard deviation is usually calculated for a given column and it’s normalised by N-1 by default. The degrees of freedom of the standard deviation can be changed using the ddof parameter.
In this article I will make sure the reason why we use the standard deviation is clear and then we will look at how to use Pandas to calculate the standard deviation for your data.
Let’s get started!
Standard Deviation and Mean Relationship
I have read many articles that explain the standard deviation with Pandas simply by showing how to calculate it and which parameters to pass.
But, the most important thing was missing…
An actual explanation of what calculating the standard deviation of a set of data means (e.g. for a column in a dataframe).
The standard deviation tells how much a set of data deviates from its mean. It is a measure of how spread out a given set of data is. The more spread out the higher the standard deviation.
With a low standard deviation most data is distributed around the mean. On the other side a high standard deviation tells that data is distributed over a wider range of values.
Why do we use standard deviation?
To understand if a specific data point is in line with the rest of the data points (it’s expected) or if it’s unexpected compared to the rest of the data points.
Pandas Standard Deviation of a DataFrame
Let’s create a Pandas Dataframe that contains historical data for Amazon stocks in a 3 month period. The data comes from Yahoo Finance and is in CSV format.