*The following notes were adapted from a talk given at a machine learning discussion group this semester at Columbia. They have not been edited, nor have they been checked for completeness/accuracy. I do hope, however, that someone finds their whirlwind style an entertaining introduction to the applications of RNNs in quant finance.*

In contrast to a number of current curricula which focus on either the mathematics-heavy side of neural networks, beginning with an introduction to backpropagation for feedforward networks, or those which focus on constructing huge black-box multi-layered “deep” neural network models, we seek to introduce and motivate the usefulness of a specific subclass of networks, Long Short-Term Memory Networks, or LSTMs, as applied to time series prediction problems in quantitative finance.

You should already know the answer to this if you have taken any basic quantitative finance course or done any personal experimentation. Practically all financial data is in the form of time series, due to the time dependence of price/economic conditions. In contrast to time series data coming from other domains of science, engineering, and mathematics, financial time series is a particularly difficult problem, such as in the case of price data for equities, foreign exchange, futures markets in areas such as commodities, and the like. What makes forecasting TS for financial data so hard? Strong trends have high reversal rates, there exists large amounts of noisy movement, stochasticity, usually modeled with Brownian motion etc., and low seasonality or ability to make similar assumptions about movements.

Past work in statistics for time series prediction includes models as simple as regression up through similarly-motivated but more complex models such as autoregressive integrated moving averages (ARIMA), and extensions thereof. In industry, very basic time series models are relied heavily upon for signal generation, usually constructed by moving and exponential moving averages fed into relatively arbitrary normalization terms and constraint sets. The shortcomings of these approaches are obvious, relying only on data from the price series, and basic statistical tools, though reasonable results can be obtained for different types of TS data in academia, very little success is seen with respect to financial time series forecasting.

An LSTM is a variety of Recurrent Neural Network (RNN), which is itself a flavor of ANNs, the general class of artificial neural networks.

Very briefly, for those of you unfamiliar, a(n) (artificial) neural network is a set of interconnected neurons, each firing (propagating) in accordance with a weighted sum of inputs passed through an activation function. These models are biologically inspired, and have been used to great success in everything from computer vision to natural language processing.

Feedforward neural networks are a specific topology in which inputs are passed through interconnected *layers* of artifical neurons, which, in addition to the neurons for input and output, form the hidden layers. A number of hidden layers take inputs directly from the one above them and pass outputs directly to the one below. In this way, with a bit of math or a more formal notation, we can see that the feedforward neural net can be expressed as a matrix multiplication applied to the input vector, or, stated otherwise, it converges to approximating a function that models the relation between inputs and outputs. Notably, this can be non-linear, leading to many use cases in classification etc. Also notable, there is no state dependency unless prior states (of a TS, for example) are fed in. This makes feedforward neural networks not particularly performant for TS prediction problems, especially in the case of chaotic price data.

Recurrent neural networks are networks with loops. These loops (recurrencies) give rise to persistent information, or, in other words, create a stateful network. Not only does the behavior of the network depend on the set of inputs, but also on all prior sets of inputs (to some reasonable limit, discussed later) which have given way to the current state of the network.

RNNs are universal, aka Turing complete, since they can compute anything a computer could compute given enough units and a trained weight matrix. In this sense, one can view the training of RNNs as the training of “universal program approximators” as opposed to the “universal function approximators” feed-forward neural networks are usually viewed as.

A nicely illustrated tutorial which we will go through:

LSTMs are particularly well suited to time-series prediction because they can “learn” and “remember” in long-term memory things like market regimes, whereas short-term memory and good interaction with lookback windows (and even time-irregular data or large steps between significant events) leads to solid performance in short-term trend prediction.

Read about:

- backwards-time backpropagation/gradient descent
- data normalization for neural networks
- activation functions for FFNNs
- activation functions and gates for LSTMs
- completeness proofs and other analysis of RNNs

It sounds like LSTMs are great! How can I use them in my next job or financial modeling/data science project? In short: python. Even shorter (with respect to development time): tensorflow/theano. Even shorter? Keras. Writing the code to perform efficient time-based gradient descent is difficult, not to mention implementing different topologies, preprocessing, activation functions, and so on. Please, for your own sake, use a python library. A few brief comments with respect to these libraries:

- Theano is fast, and optimized for use in general deep learning research. It supports nearly all model types, including models that are relatively emergent in literature. Its python bindings are relatively easy to use, though they are a bit advanced for early stage users.
- Tensorflow is slower (in some cases), but has the best model support, and (in my opinion) community. It has good documentation and a good python API.
- Keras, which is just a wrapper on top of both of the above frameworks, is the best choice in almost all cases. Fantastic documentation supplements an easy-to-use python API, which allows for the easy specification of models that can run on many backends and computing systems. Plays well with numpy and pandas, etc. Really, this is the best way to make your life easier.
- Other C++ -bound DL libraries are sometimes faster, as there is overhead in python function calls etc. Usually, besides model definition and kernel compilation (and memory copy), the host CPU isn’t doing much, but keep this in mind for production or if you don’t have access to a CPU.

LSTMs, though not on the scale of so-called “deep” (some, convolutional) neural networks, are difficult to train, mainly because of the computationally expensive nature of time-based gradient descent, the size of the networks, and the amount of data over which they must be trained. Further complicating, since LSTMs are stateful, many problems require “online” training, meaning that they cannot be trained all at once by highly optimized, vectorized calculations in a batch with error computation and gradient descent over groups. To this end, I highly recommend running training on a personal server with a GPU, which greatly optimizes tensor math through parallelism. CuDNN and other work by Nvidia has led to a great amount of gain in speed for training these networks, as well as the libraries mentioned above.

Other (more academic problems) include convergence time, throwing away noise (or reducing its effect on convergence time), finding correlated or predictive data to feed through these models, data normalization/regularization/processing, decision making for trading based upon output, and much more.

Luke will forward some papers. Most have to do with either

- topology or parameter selection
- optimizing lookback windows and learning “elasticity” while maintaining statefulness for good short-term predictions in a sliding window
- regarding LSTMs at large: optimizing training
- optimizing and pruning large networks
- optimizing and pruning training data (concept of “attention”)
- generative models and other fancy stuff or fun applications

Some code:

— Lucas Schuermann