AI For Trading: Data Processing (5)
Tick data is sometimes referred to as heterogeneous data, since it is not sampled at regular time intervals, whereas Minute-level or End-of-Day data is called homogeneous. Converting heterogeneous data to a homogeneous form can be a good exercise!
When to use time stamps
Assume that you are using minute-level stock data that includes a timestamp for each row, indicating the beginning of that minute. Let’s say the data spans a single month. In which of the following scenarios would you use these timestamps (check all that apply)?
Aggregating the volume of trades per day
Adjusting for gaps due to market closing and opening
Corporate Action:Stock Splits
Although a stock split shouldn’t theoretically affect the market cap of a stock, in reality it does! There are some intriguing behavioral patterns that researchers have observed among traders.
One seems to suggest that after a stock splits, and the price drops considerably, people seem to think it is going to go back up to the previous price (double or triple)!
This creates an artificial demand for the stock, which in turn actually pushes up the price.
Moving-window or “rolling” statistics are typically calculated with respect to a past period.
Therefore, you won’t have a valid value at the beginning of the resulting time series, till you have one complete period.
For instance, when you compute the Simple Moving Average with a 1-month or 30-day window, the result is undefined for the first 29 days.
This is okay, and smart data analysis libraries like Pandas will mark these with a special “NA” or “nan” value (not zero, because that would be indiscernible from an actual zero value!). Subsequent processing or plotting will interpret those as missing data points.
How many trading days are there in a typical year for NYSE?
Yep! The NYSE and NASDAQ average about 252 trading days a year. This is from 365.25(days on average per year) * 5/7(proportion work days per week) = 260.89 - 9(holidays) = 251.89 ~ 252.
Experiment A: Randomly select a smattering of 100 stocks that are trading today, simulate buying them in 2005, or whenever they went public, investing equally in each, and hold on to them till the present day. Don’t try to apply any strategy, just pick stocks randomly!
Experiment B: Randomly select another collection of 100 stocks, but this time, from those that were trading in 2005. Again, simulate buying them in 2005, investing uniformly, and hold on to them.
Repeat these experiments multiple times and calculate the total return on your investment in each case. Now, would you expect the mean return for A to be significantly higher or lower than that of B? See if you can spot a clear difference.
Would you expect the mean return for A to be significantly higher or lower than that of B?
A：Mean return from A would be higher.
B：Mean return from B would be higher.
理由：You're right! The average return from Experiment A would indeed be higher than that from B. This is due to a phenomenon known as Survivor Bias, which is the subject of the next video!