Statistics with Python
Mathematics permeates every aspect of human existence. In actuality, mathematics and statistics are used in everything around us, including patterns, colors, and even the number of petals on a flower. Because math and statistics are the foundation of all machine learning algorithms, data science demands a good background in these subjects.
Dealing with data requires the ability to graphically describe, summarize, and represent data. The comprehensive, practical, and frequently used Python Statistics Libraries can be used to manage your data.
This branch of applied mathematics deals with the gathering, analysis, interpretation, and presentation of data. The process of gathering data, tabulating it, and interpreting it is known as statistics. We can understand through statistics how data can be applied to tackle challenging issues.
Python statistics libraries to choose from:
There are numerous Python statistics libraries available for usage, but in this tutorial, you’ll learn about some of the most well-liked and frequently used ones:
Python has a built-in descriptive statistics package called statistics. If your datasets are not excessively vast or if you are unable to rely on importing other libraries, you can make use of them.
Designed specifically for use with single- and multidimensional arrays, NumPy is a third-party toolkit for numerical computing. Its default type is the ndarray array type. There are numerous statistical analysis routines in this package.
A third-party scientific computing library built on NumPy is called SciPy. Compared to NumPy, it provides more capabilities, such as the statistical analysis module scipy.stats.
A third-party library for numerical computing built on NumPy is called Pandas. With Series objects and DataFrame objects, it excels at managing labeled one-dimensional (1D) and two-dimensional (2D) data.
A third-party data visualization library is called Matplotlib. Together with NumPy, SciPy, and Pandas, it performs admirably.
It should be noted that Series and DataFrame objects can frequently replace NumPy arrays. Most of the time, they may simply be passed to a statistical function in NumPy or SciPy. Additionally, calling “values” or “to numpy” will return the unlabeled data from a series or dataframe as a np.ndarray object ().
Descriptive Statistics Calculation
n simple terms, descriptive statistics refers to the process of explaining data using representational tools like charts, tables, Excel files, etc. The data is presented in a way that allows it to convey some significant information that can also be utilized to identify certain potential patterns in the future.
Univariate analysis is the process of describing and condensing a single variable. The bivariate analysis describes the statistical association between two variables. Multivariate analysis is the process of describing the statistical association between several variables.
Import all the necessary packages first:
>>> import math >>> import statistics >>> import numpy as np >>> import scipy.stats >>> import pandas as pd
You won’t require any other packages for Python statistics calculations. Python’s built-in math library will be helpful in this tutorial, even though you won’t typically use it. Matplotlib.pyplot will be imported later for data visualization.
Statistics Library for Python
Data collection, analysis, interpretation, and presentation are all topics covered in the mathematical science of statistics. Data scientists and analysts can search for significant data patterns and changes since statistics are used to solve complex real-world problems. Simply put, statistics can be applied to data to derive insightful conclusions through mathematical measures.
1. Central Tendency Measures: These measures display the datasets’ middle or central values. What constitutes the core of a dataset is defined in a variety of ways. This will teach you how to recognize and compute various measures of central tendency.
a. Mean: Mean is calculated by dividing the total number of observations by the observational sum. The sum divided by the count is also used to define the average.eg:
from statistics import mean numbers = [1, 2, 3, 4, 5] mean = mean(numbers) print(mean)
Output
3.0
b. Median: It is the data set’s median value. It divides the data into two halves. The median is the average of the two central components if the number of items in the data set is even. If the number of elements is odd, the median is the center element. eg:
from statistics import median numbers = [1, 2, 3, 4, 5] median = median(numbers) print(median)
Output:
3.0
c. Median_low: When there are even numbers of elements, the median_low() function returns the lower of the two middle elements; otherwise, it returns the median of the data. The StatisticsError exception is generated if the given argument is null.eg:
from statistics import median_low numbers = [5, 1, 3, 2, 4, 6] median = median_low(numbers) print(median)
Output
3
d. Median_high: When there are even numbers of elements, the median_high() function returns the higher of the two central elements; otherwise, it returns the median of the data. A statistics error is raised if an empty argument is supplied. eg:
from statistics import median_high numbers = [5, 1, 3, 2, 4] median = median_high(numbers) print(median)
Output:
3
e. Median_grouped: Use interpolation to calculate the median of the grouped continuous data, which is equal to the 50th percentile. Eg:
from statistics import median_grouped data = [1, 1, 1, 2, 3, 3, 3, 4, 5, 5, 5] median = median_grouped(data) print(median)
Output:
2.5
f. Mode: The quantity that occurs the most frequently or the most frequently. In the provided data set, it is the value that occurs the most frequently. If every data point is equally frequent, the data set might not have a mode. If we come across two or more data points with the same frequency, we may also have more than one mode. eg:
from statistics import mode numbers = [1, 2, 3, 3, 4, 4, 4, 5] mode = mode(numbers) print(mode)
Output:
4
2. Variability Measures: Data cannot be adequately described by central tendency measures alone. Additionally, you will require the variability measurements that express the range of data values. You’ll discover how to recognize and compute the following variability metrics in this section:
a. Standard Deviation: It is referred to as the square root of the variance. Finding the Mean, also referred to as the average, subtracting each number from it, and then squaring the result yields the answer. Computing the square root after dividing all the values by the number of words and adding together all the values. eg:
from statistics import stdev numbers = [1, 2, 3, 4, 5] stdev = stdev(numbers) print(stdev)
Output:
1.5811388300841898
b. Variance: A typical squared departure from the mean is what it is called. It is computed by calculating the difference between each data point and the average, also referred to as the mean, squaring the difference, adding all the data points together, and then dividing by the total number of data points in our data set. eg:
import statistics sample = [1, 2, 3, 4, 5] variance = statistics.variance(sample) print(variance)
c. Skewness: A data sample’s skewness is a measurement of how asymmetrical it is. Skewness has numerous mathematical definitions. The expression i(xi mean(x))3 n / ((n 1)(n 2)s3), where I = 1, 2,…, n and mean(x) represents the sample mean of x, is more straightforward. The adjusted Fisher-Pearson standardized moment coefficient is the name given to the skewness when it is described in this way.
d. Percentiles: The dataset element with the sample p percentile equal to or less than p% of the dataset’s elements is known as the sample p percentile. Additionally, (100 p)% of the elements have values that are larger than or equal to that figure. If the dataset contains two such elements, then the sample p percentile corresponds to their arithmetic mean. There are three quartiles for each dataset, which are the percentiles that divide the dataset into four sections.
e. Ranges: The difference between the dataset’s maximum and least element is known as the range of data. It is accessible through the np.ptp function (). If your NumPy array has any nan values, this method returns nan. When a Pandas Series object is used, a number is returned.
Data visualization:
You can show, describe, and summarise data using visual approaches in addition to numerical computing numbers like mean, median, and variance. The following graphs will be used in this section to teach you how to visually show your data:
a. Box plots: An illustration of data from a five-number summary that includes one of the measures of central tendency is called a box plot. In comparison to a stem and leaf plot or histogram, it does not depict the distribution in question. However, its main applications are to determine whether a distribution is skewed and whether any outliers—potentially unexpected observations—may be present in the data collection.
b. Histograms: A histogram is a bar graph-like data visualization that groups various class levels into columns along the horizontal x-axis. Each column’s data count or percentage of occurrences is shown on the vertical y-axis. Data distribution patterns can be seen visually using columns.
c. Pie charts: Pie charts are used to depict data that has a few labels and certain relative frequencies. They are effective even with labels that cannot be ordered (like nominal data). A circle that has been sliced into several slices is a pie chart. Each slice represents a single unique label from the dataset, and it contains an area whose size is proportional to the relative frequency of that label.
d. Bar charts: Bar graphs can also be used to display data that has discrete numerical values or labels associated with it. They can show the data pairs from two datasets. The labels belong to one set, and their matching frequencies belong to the other. They may also choose to display the errors connected to the frequencies.
e. X-Y plots: The data pairings from two datasets are shown as a scatter plot or x-y plot. The vertical y-axis displays the matching values from the set y, while the horizontal x-axis displays the values from the set x. The regression line and the correlation coefficient are optional. Using scipy.stats.linregress, let’s create two datasets and do linear regression ().
f. Heatmaps: A matrix can be represented graphically using a heatmap. The colors represent the matrix’s numbers or components. In particular, heat maps help display the covariance and correlation matrices. With imshow(), you can generate a heatmap for a covariance matrix:
Conclusion:
Python is frequently used to create websites and applications, automate tasks, analyze data, and visualize data. Many non-programmers, including accountants and scientists, have adopted Python because it’s reasonably simple to learn and can be used for several common activities like managing finances.
Python has a sizable and vibrant developer community that adds to the language’s collection of modules and libraries and serves as a valuable resource for other programmers. The extensive support network makes it very simple for coders to discover a solution when they run into an issue because someone else has almost certainly already faced it.
