AI For Trading: Regression (15)


Checking and transforming data



Note: Regarding the chart used to describe signal to noise. The horizontal x-axis of the chart is time, such as days. The vertical y-axis is price, such as dollars. The blue line represents the combined signal plus noise, which is the actually observed stock price movement. The red dashed line represents the signal without noise, which is not directly observable.



Many statistical models assume that the data follows a normal distributions, also referred to as a Gaussian or a bell curve.

This is important when checking whether our models are valid. There are various tests that we use to check that our models describe a meaningful relationship.

Exercise: Visualize Distributions

Many variables tend to follow a Normal distribution (hence the name “Normal”), both in nature as well as artificial contexts. But there are other distributions as well, some that are variants of the Normal distribution, and some that are completely different! Each distribution is suitable for modeling certain kinds of variables.

In this exercise, you are given some samples of data. Plot the histogram of each sample, and then try to match it with the corresponding distribution.

"""Visualize the distribution of different samples."""

import pandas as pd
import matplotlib.pyplot as plt

def plot_histogram(sample, title, bins=16, **kwargs):
    """Plot the histogram of a given sample of random values.

    sample : pandas.Series
        raw values to build histogram
    title : str
        plot title/header
    bins : int
        number of bins in the histogram
    kwargs : dict 
        any other keyword arguments for plotting (optional)
    # TODO: Plot histogram (no need to return anything)
    # width = 0.7 * (bins[1] - bins[0])
    # center = 8
    #, sample, align='center')

    # plt.title(title);

def test_run():
    """Test run plot_histogram() with different samples."""
    # Load and plot histograms of each sample
    # Note: Try plotting them one by one if it's taking too long
    A = pd.read_csv("A.csv", header=None, squeeze=True)
    plot_histogram(A, title="Sample A")

    B = pd.read_csv("B.csv", header=None, squeeze=True)
    plot_histogram(B, title="Sample B")

    C = pd.read_csv("C.csv", header=None, squeeze=True)
    plot_histogram(C, title="Sample C")

    D = pd.read_csv("D.csv", header=None, squeeze=True)
    plot_histogram(D, title="Sample D")

if __name__ == '__main__':