Quantitative Python
Risk management with Python
Rafael Ballestas

Comment 
philosophy 
20190409 
Now that we have an understanding of risk concepts such as the loss exceedance curve, valueatrisk, Bayes Rule, and fitting distributions, we would like to have a realiable, extensible and preferably open tool to perform these computations. In the background, we have used a spreadsheet, which is hard to extend. We have used GNU Octave, which is good, but not a proper programming language. Our favorite language at Fluid Attacks, Python, has modules for statistics, scientific computation and even finance itself. Let’s take it for a spin around this risky neighborhood.
Python has a whole ecosystem for numerical computing (v.g. Numpy) and data analysis (Pandas) and is well on its way to becoming a standard in Open Science. Being a free and open source tool, there are also many derived projects which make life easier when coding, such as the Jupyter Notebook, which allows us to selectively run code snippets, much like in commercial packages such as Matlab and Mathematica. This enables and encourages, at least, initial exploration, although it might not be the best fit for developing more involved code.
Let us see how we can automate the generation of a loss exceedance curve via Monte Carlo simulation. Here we will closely follow our article on the subject, so as not to duplicate information. In that article, we wanted to find a distribution for losses based on expert estimations of occurrence likelihood and confidence intervals for the impact:
So we need to read those values in our script. Since this is tabular information of the kind that would be useful to view, say, in a spreadsheet, it will be convenient to read this into a Pandas dataframe:
import pandas as pd
events = pd.read_csv('events.csv')
events.head()
We declare an event happens if a number taken at random is beyond a certain threshold, given by the second column in the table above:
def event_happens(occurrence_probability):
return np.random.rand() < occurrence_probability
If and when the event happens, we next need to know the extent of the loss due to this single event. Recall that we modeled this with a lognormal variable, whose parameters we got from the estimated confidence interval:
def lognormal_event_result(lower, upper):
mean = (np.log(upper) + np.log(lower))/2.0
stdv = (np.log(upper)  np.log(lower))/3.29
return np.random.lognormal(mean, stdv)
All of the events in the above table can happen in a single year, so to simulate a scenario, we need to find out, for each of them, if they happen, and how much money they will cost us. Finally, we add all the losses and return that single number as a summary of the losses in a simulated year:
def simulate_scenario():
total_loss = 0
for _, event in events.iterrows():
if event_happens(event['Probability']):
total_loss += lognormal_event_result(event['Lower'], event['Upper'])
return total_loss
Now, the crucial step in MonteCarlo simulation is to simulate many scenarios and record those results. Let us write a function that does just this, returning the results in a basic Python list, which we could later turn, if we so wished, into a Pandas or Numpynative structure for statistical analysis. The function takes as input the number of times we want to simulate scenarios:
def monte_carlo(rounds):
list_losses = []
for i in range(rounds):
loss_result = simulate_scenario()
list_losses.append(loss_result)
return list_losses
Going graphic
Just to get a feeling for the results, let us run a thousand scenarios and plot them, that is, the result of each simulated year, in the order in which they were obtained. As foretold, we could convert the results into a Pandas DataSeries, if anything, to illustrate how they work. We also need to import Matplotlib for visualization:
import matplotlib.pyplot as plt
results = monte_carlo(1000)
results_series = pd.Series(results)
results_series.plot()
It can be observed that the vast majority of them are in the fringe between 0 and 15 million. But it is not infeasible to have results that are way beyond the central interval. In order to rule out what’s simple chance and what is really happening due to the distribution of loss, we can simply run more scenarios. Tens or hundreds of thousands of scenarios is a good rule of thumb, without sacrificing performance. A thousand runs takes around 5 seconds, and 10000 takes around 50. At some point adding more simulations does not necessarily improve the quality of results. Your mileage may vary.
No matter the number of scenarios, the results are not as useful as they could be until we aggregate them, v.g., in a histogram. Pandas also provides a shorthand for this:
results_series.hist(bins = 15)
We’re getting closer to the loss exceedance curve, but not there yet. Recall that the loss exceedance curve gives, for every value in dollars, the probability that the loss will be greater than or equal to that value. Since we don’t have, and can’t possibly have, a real probability distribution, i.e., a mathematical function, but only an approximation to one in the form of a table, we estimate probabilities simply by counting occurrences. Hence the "probability", or estimation thereof, of a single value, is the number of times that value appeared in the simulation. So let us take evenly spaced values, and count the number of times each of those values is exceeded (or matched). The numpy function cumsum does just that, except in the opposite order: it adds the values seen up to a moment. So if we take the intervals and the counts separately, revert the counts list and then do cumsum on it, we get what we need, in reverse order. To fix that we simply revert again:
import numpy as np
result_nparray = np.array(results_list)
hist, edges = np.histogram(results_nparray, bins = 20)
cumrev = np.cumsum(hist[::1])[::1]*100/len(results_nparray)
plt.plot(edges[:1], cumrev)
And voilĂˇ, we get our loss exceedance curve as we sought:
We can repeat the procedure with a more moderate dataset to obtain the inherent risk LEC, in the sense that the probabilities and impact `CI`s are lower. And finally, to obtain the risk tolerance curve, we give a few points obtained from interviewing someone in charge, as described in the original article and fitting a curve to it using SciPy’s Inteporlation functions:
from scipy import interpolate
xs = np.array([1,2,3,7,9])*(1e6)
tols = np.array([100,60,10,2,1])
xint = np.linspace(min(xs), max(xs))
fit = interpolate.interp1d(xs, tols, kind='slinear')
plt.plot(xint, fit(xint))
All together in a single plot:
Risk measures
Now obtaining the 5% value at risk is simply a matter of asking for the 95% percentile of the "distribution", i.e., the actual simulation results, in its Numpyarray incarnation:
>>> np.percentile(arr, 95)
23360441.53826834
Hence the VaR, according to this particular simulation is a little over $23 million. Just as simple as it is to obtain the tail value at risk. If we had a mathematical function for the distribution we would have to compute an integral in order to obtain it, but since what we have is a discrete approximation to it, i.e., a simple table of values, we can just average the values that are under the VaR:
>>> np.average(arr[arr >= var])
31949559.99328234
Thus in case of a VaR breach, we can expect the loss to be of little less than $32 million.
Let us simulate the input values for the simulation, as if we were running the simulation every day with different occurrence probabilities and impacts. That is, let us randomize the table in figure 1 above. We can skip the importing and the column with the events names, since we don’t use them. We simply generate a column of random numbers between appropriate boundaries: 0 and 0.1 for the probabilities, 0 and a million for the lower CI bounds, a million and a billion for the upper CI bounds. Finally we glue them together in a Pandas DataFrame so they are usable to our previous functions:
def gen_random_events():
probability_column = np.random.random_sample(30)*0.1
lower_ci_column = np.random.random_sample(30)*(1e6)
upper_ci_column = np.random.random_sample(30)*(9e6)+1e6
dicc = {'Probability' : probability_column,
'Lower' : lower_ci_column,
'Upper': upper_ci_column}
events_rand = pd.DataFrame(dicc)
return events_rand
Next we run MonteCarlo on such a simulated DataFrame of events, once for each day of a fictitious month, compute the VaR and tVaR for each day and observe how they evolve:
Since this is a madeup example and the probabilities are sampled simply, i.e., from a uniform distribution, the results are, well, uniform. However for the sake of conclusion, we can imagine there is a steady, if slow, trend towards raising the VaR. It is interesting that the highest peak in tVaR corresponds to a VaR that is not that different from its neighbors. This goes to show that one is not just a simple function of the other, which is often the case in dealing with uncertainty.
Let this article serve merely as a taste of what can be done using data analysis tools in the numerical Python ecosystem. There are plenty of modules specifically for finance and Monte Carlo simulations. Imagine what those could do for your organization in regards to risk management. Stay tuned for more of riskapplied Python.
References

C. DavidsonPilon (2019). Probabilistic Programming and Bayesian Methods for Hackers.

C. Motiff (2019). Monte Carlo Simulation with Python

B. Mikulski (2018). Monte Carlo simulation in Python
Appendix: Full script
Download Python script or as Jupyter notebook and input data for inherent risk and residual risk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145  #!/usr/bin/env python
# coding: utf8
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
def event_happens(occurrence_probability):
""" An event happens if a randomly chosen number
falls bellow the given occurrence probability """
return np.random.rand() < occurrence_probability
def lognormal_event_result(lower, upper):
""" Draws a number from the lognormal distribution
with given lower and upper bounds for confidence interval """'
mean = (np.log(upper) + np.log(lower))/2.0
stdv = (np.log(upper)  np.log(lower))/3.29
return np.random.lognormal(mean,stdv)
def simulate_scenario(events):
""" If an event from the input list happens, add the losses due to it """
total_loss = 0
for _, event in events.iterrows():
if event_happens(event['Probability']):
total_loss += lognormal_event_result(event['Lower'],event['Upper'])
return total_loss
# Read in the basic events table and test scenario simulation
events_basic = pd.read_csv('events.csv')
simulate_scenario(events_basic)
def monte_carlo(events, rounds):
""" Simulate many scenarios, returns the results as simple List """
list_losses = []
for i in range(rounds):
loss_result = simulate_scenario(events)
list_losses.append(loss_result)
return list_losses
# To test Monte Carlo execution time for some number of iterations
start_time = time.time()
monte_carlo(events_basic, 1000)
print(time.time()  start_time)
# This time run for real, save the results as pandas Series and as numpy parray
results = monte_carlo(events_basic, 1000)
results_series = pd.Series(results)
results_nparray = np.array(results)
results_series.describe()
# Plot the results in the order they came out
results_series.plot()
plt.ticklabel_format(axis='y', style='sci', scilimits=(6,6))
plt.xlabel('Iteration')
plt.ylabel('Loss (millions)')
plt.title('MonteCarlo simulation results')
plt.savefig('resultsraw.png')
# Aggregate and plot them as a histogram
results_series.hist(bins=15)
plt.xlabel('Loss (millions)')
plt.ylabel('Frequency (count)')
plt.ticklabel_format(axis='x',style='sci',scilimits=(6,6))
plt.title('Aggregated simulation results')
plt.savefig('resultshist.png')
def plot_lec(results_nparray, label):
""" Plots the loss exceedance curve from
an nparray of Monte Carlo results """
hist, edges = np.histogram(results_nparray, bins=40)
cumrev = np.cumsum(hist[::1]*100/len(results_nparray))[::1]
plt.plot(edges[:1], cumrev, label=label)
plt.xlabel('Loss (millions)')
plt.ylabel('Chance of loss or greater (%)')
plt.ticklabel_format(axis='x',style='sci',scilimits=(6,6))
plt.title('Loss Exceedance Curve')
plt.grid()
#plt.xscale('log') # opt make x axis logarithmic
# Obtain the residual risk curve
events_redux = pd.read_csv('events_redux.csv')
results_redux = monte_carlo(events_redux, 100)
arr_redux = np.array(results_redux)
plot_lec(arr_redux, 'Residual risk')
plot_lec(results_nparray, 'Inherent risk')
plt.xscale('log')
plt.xlabel('Loss')
plt.grid()
# Interpolate the risk tolerance curve
xs = np.array([1,2,3,7,9])*(1e6)
tols = np.array([100,60,10,2,1])
plt.plot(xs, tols, 'o')
xint = np.linspace(min(xs), max(xs))
yint = interpolate.interp1d(xs, tols, kind='slinear')
plt.plot(xint, yint(xint))
plt.xscale('log')
plt.legend()
def get_vars(array):
""" Computes the 5% VaR and tVar from an nparray of Monte Carlo results """
var = np.percentile(array, 95)
tvar = np.average(array[array >= var])
return var, tvar
def gen_random_events():
""" Simulates read input data """
probability_column = np.random.random_sample(30)*0.1
lower_ci_column = np.random.random_sample(30)*(1e6)
upper_ci_column = np.random.random_sample(30)*(9e6)+1e6
dicc = {'Probability' : probability_column,
'Lower' : lower_ci_column,
'Upper': upper_ci_column}
events_rand = pd.DataFrame(dicc)
return events_rand
def simulate_daily_vars(num_days):
""" Runs MonteCarlo over a number of days with simulated inputs """
vars, tvars = [], []
for i in range(num_days):
events = gen_random_events()
results = monte_carlo(events, 100)
results_nparray = np.array(results)
var, tvar = get_vars(results_nparray)
vars.append(var)
tvars.append(tvar)
return vars, tvars
# Simulate t/Var monitoring
days = 30
vars, tvars = simulate_daily_vars(days)
t = np.arange(1, days + 1)
plt.plot(t, vars, label='VaR')
plt.plot(t, tvars, label='tVaR')
plt.title('Evolution of daily VaR in month')
plt.ticklabel_format(axis='y', style='sci', scilimits=(6,6))
plt.ylabel('t/VaR (millions)')
plt.xlabel('Day of the month')
plt.legend()
plt.savefig('monitorvartime.png')

Mathematician 
with an itch for CS 