Enzyme Data Analysis

Image from https://depts.washington.edu/wmatkins/kinetics/michaelis-menten.html

About

The main focus of this project was to find the best fitting model parameters to represent a data set of reaction velocities and substrate concentrations. In an ideal situation, the data could be perfectly modeled by the Michaelis-Menten curve shown in the image with the red curve. The equation to model this is shown below the graph. V0 refers to the reaction velocity v. How we obtain V0 from the set of data will be further discussed later. V_max refers to the maximum reaction velocity or the asymptote at which the red curve approaches. K_m is the substrate concentration that corresponds to a reaction velocity that is half V_max. And S is the substrate concentration.

In this project, we were given 5 enzymes to analyze: A, B, C, D, E. Each enzyme had a test 1 and an extra test 1 duplicate. We therefore had 10 sets of data in total. Our first step was to obtain the V0 values from each set of data. For each set of data, we were given 10 seperate initial substrate concentrations and we could therefore determine 10 sets of points on the graph (i.e. (S1, V1), (S2, V2)... (S10, V10)). The main challenge in this project was to determine the best fit V_max and K_m parameters in the Michaelis-Menten Equation to represent the 10 sets of points we have found from the data. We combined two fundamental steps in order to solve this nonlinear regression problem. The first step involved an algebraic approach while the second step involved mixing and matching our potential V_max and K_m candidates. Finally, we apply gradient descent to further improve the resulting V_max and K_m values from the previous steps. Our ultimate goal is to be as accurate as previously discovered methods such as the Eadie–Hofstee diagram, Hanes–Woolf plot and Lineweaver–Burk plot.

The Excel data and algorithm written on MATLAB can be found here.

Obtaining the reaction velocities V0

To simplify the explanation, we will just be focusing on test 1 of Enzyme A (the same process was performed on the Enzymes and tests).

To obtain the 10 sets of points i.e. (S1, V1), (S2, V2)... (S10, V10) from Enzyme A Test 1, we will have to determine the reaction velocities v0 from the data. At the moment, we only have the value of S1, S2 ..., S10 (the substrate concentrations) known that is the values 3.75 uM, 7.5 uM, ..., 2000 uM. The reaction velocities v0 for each substrate concentration can be found as the initial rate of change of the plot of [P](uM) vs Time (s). To find the corresponding reaction velocity v0 for substrate concentration 3.75 uM, a plot is shown of [P](uM)- column B vs Time (s)- column A. Keep in mind the plot here is not the same as the Michaelis-Menten curve shown in red above.

The reaction velocity v0 can be represented by the slope of the tangent line at the initial interval of the [P] (uM) vs Time (s) plot. Based on our algorithm, we determined that it was most accurate to find the slope of the line between the initial point and the point at the 5% interval.

The plot between [P](uM)- column B vs Time (s)- column A has data recorded up to 1480 seconds. We therefore find the first 5% interval which is from 0 seconds to 74 seconds. The corresponding value of [P] (uM) at time 74 seconds is 1.492 uM and we know that the initial point is (0 seconds, 0 uM). The slope of the tangent line can then be found as (1.492 - 0) / (74 - 0) which yields a reaction velocity v0 of approximately 0.2016 uM / s. We have found our first point (S1, V1) as (3.75 uM, 0.2016 uM / s). The same process is done to find the rest of the points (S2, V2), (S3, V3)...(S10, V10).

Because we were given a test 1 and extra test 1 duplicate for each enzyme, we ended up with two computed V0s for each substrate concentration. We decided to average the two V0 values to move forward with our analysis. For example while Enzyme A test 1 yielded an initial point of (3.75 uM, 0.02016 uM / s), Enzyme A test 1 duplicate yielded an initial point of (3.75 uM, 0.02684 uM / s). We would take the average and use the point (3.75 uM, 0.0235 uM / s) in our analysis.

The rest of the points for Enzyme A were determined by the same process and are as follows:
(3.75 uM, 0.0235 uM / s), (7.5 uM, 0.0437 uM / s), (15 uM, 0.0788 uM / s), (30 uM, 0.1552 uM / s), (65 uM, 0.2633 uM / s), (125 uM, 0.3997 uM / s), (250 uM, 0.5553 uM / s), (500 uM, 0.6730 uM / s), (1000 uM, 0.7908 uM / s), (2000 uM, 0.8794 uM / s).

Nonlinear Regression

By plotting our previously determined points from the section 'Obtaining the reaction velocities V0', we would come up with points that may seem to line up to fit the Michaelis-Menten curve similar to the red curve at the top of this page.

From the Michaelis-Menten equation, we have two unknown parameters: V_max and K_m. The animation from desmos is shown where the variable y represents V0 (reaction velocity v), x represents Substrate Concentration [S], v represents V_max and k represents K_m. Our goal is to find the best V_max and K_m to model our data points. In an ideal world, there will be a value for V_max and K_m which will yield a curve that passes through each and every single data point.

Step 1: The Algebraic Approach

The main idea involves solving a system of equations. At the moment, we are solving for two unknown variables V_max and K_m. Because we need two equations to solve two unknowns, we can pick any two coordinates from our previously determined 10 points to form two equations. The working is shown with the final equations highlighted.

Just for example, we might test out points (S7, V7), (S10, V10). Their values are (250 uM, 0.5553 uM / s) and (2000 uM, 0.8794 uM / s) respectively. Based on the highlighted final equations, we should calculate K_m first since that equation does not hold any unknown values. Though it does not matter which points we pick for V1, S1 and V2, S2, it is important to be consistent after you choose the values. The computed K_m based on these two coordinates is 189.925 uM and the V_max value was found to be 0.9594 uM / s. This method can be repeated for several different combinations of points e.g. (S7, V7), (S8, V8) or (S7, V7), (S9, V9)... etc.

With this method, the challenge was to determine which two set of coordinates out of our 10 would be the best to select and represent the model. If we try out all possible combinations we will get 0.5*n*(n-1) different possibilities (where n is the number of points total). We therefore have 45 different combinations to test and try out.

The first iteration of this idea was to find the best pair of coordinates out of the 45 combinations which yielded V_max and K_m values that resulted in the least SSE (Sum of Squared Errors). While this method was fairly accurate, it lacked precision (especially when we ran the algorithm on all 5 enzymes). This led us to reconsider our approach and ultimately led to Step 2: mixing and matching.

Step 2: Mix and Match / Minimizing SSE

Rather than test out just 45 possible pairs of K_m and V_max, we finalized our algorithm by testing every single possible combination of them. The potential candidates that would go through the evaluation to minimize SSE would therefore include:

(K_m1, V_max1), (K_m1, V_max2), (K_m1, V_max3)...(K_m1, V_max45)
(K_m2, V_max1), (K_m2, V_max2), (K_m2, V_max3)...(K_m2, V_max45)
(K_m3, V_max1), (K_m3, V_max2), (K_m3, V_max3)...(K_m3, V_max45)
. . . . . . .
(K_m45, V_max1), (K_m45, V_max2), (K_m45, V_max3)...(K_m45, V_max45)

We would end up with 45 * 45 = 2025 possible combinations of K_m and V_max for each of the 5 enzymes. For each pair in the 2025, the K_m and V_max would be passed to a function to evaluate the SSE. This is done by summing the squared differences between the predicted value of the coordinate points based on the provided K_m and V_max parameters vs the true values based on the 10 determined coordinate points in the beginning.

Step 3: Gradient Descent

The results from the previous two steps look pretty good. Could we improve on this though? By using gradient descent we can edge even closer to the optimal value of V_max and K_m.

We first start off from our previously found "optimal values" of V_max and K_m shown in the results above. Because these may be suboptimal values or in other words values that lead us to a local minima of the Sum of Squared Errors (SSE), the plan is to explore ±10% from our current values. If we look at Testing Pairs we could see that we try out 5 values from our current ones. Because we have both V_max and K_m this will result in 5 * 5 = 25 pairs. Each pair is run through gradient descent and after convergence the SSE is evaluated. The converged pair that provides us with the least SSE is kept.

For a deeper understanding of how gradient descent can enhance model optimization and to see its application in other contexts, consider exploring my other project blog post Neural Network Architecture from Scratch.

.