Week 1: Into the Deep

Hello again, prospective reader! It’s great to see you again. We’ve had quite the week here, so allow me to catch you up to speed, starting with the basics.Our team's logo-- the words "Stem Visualizations" In case you missed my introduction, I’m Madeline, one of four student members of the Stem Visualization team for Lehigh University’s Mountaintop Summer Experience. (Whew, what a long title! I’ll refer to us (me, Madeline, along with Tyreese and Alex as full-time Fellows and Alicia as a part-time Associate) as the team, and the program as Mountaintop for simplicity’s sake.) We’re lucky enough to have joined this ongoing effort to create interactive web models for education, focusing on allowing students to interact with complex STEM topics. The project website can be found here, as taking screenshots doesn’t really do it justice. Past models have included infectious disease modeling, zero energy cooling systems, and reaction kinetics; this year we’ve pivoted towards models surrounding bioengineering and data science. Further, I’ll update this post to share the blogs of my teammates as I get them!

With that said, I ought to get to the purpose of this posting: sharing progress with you all! It’s been a journey this week, that’s for sure. The first and most intimidating challenge of the project was simply learning the makeup of our predecessor’s work–installing prerequisite programs, gaining access to the Github repository, and studying the code to understand how the finished products came to be. This is in part made challenging by the fact it’s an all-new student team this year. Hopefully, blogs like this will help anyone coming after us to make this learning process a little less hectic.

A picture of a closed plastic plate bioreactor system. Tall vertical plastic plates house tubes of growing algae.
Closed Plastic-Plate Photobioreactor

Once we prepped our digital workspaces, the next hurdle was finding places to dive into this year’s module. Our grand concept is to present to the user an interface that models reactions within a photobioreactor. These setups grow microalgae under specific conditions, with the goal being to maximize the synthesis of a wanted product.

 

The specifics of bioengineering are not my strong suit, and I won’t pretend that they are–for further reading, linked here is the paper serving as the genesis of much of this project. What matters for us is to know that this is a complicated process with many variables at play; each of which needs to be fine-tuned if product synthesis is to be maximized. This is a goal that’s greatest suited to the technology I’ve spent the week studying, so allow me to walk you through the basic idea!

 

Machine Learning has a lot of hype burning around it right now–so much so that the words alone lose meaning. To be clear, we aren’t attempting to build AI in the sense that we’re training a computer to speak, hear or see. Our model, in comparison, is actually somewhat simple: Given a certain set of initial parameters and a trained history of what the reaction should look like under those conditions, return a prediction of what the reaction looks like in the future. Thankfully, this fits the definition of a bread-and-butter data science task: regression. With the task defined, we can move toward deciding what type of model to train.

Regression models come in many shapes, from simple linear predictors to Gaussian function predictors to multi-layer neural networks. If none of those words mean a thing to you, you now understand the place we were in at the beginning of the week–and the gargantuan task ahead of us. Much of this week’s progress came in the form of tons of research just to understand the principles behind each possible process. After all, there’s no way to be certain about the efficacy of a model if you have no clue what it’s doing! To that end, I spent about two days compiling basic information on different regression methods. You can see the fruits of my efforts below, and if any of them intrigues you, there’ll be more links to the sources I’ve referenced.

Some whiteboard notes describing the idea of Logistic Regression.

 

At this point, reader, your head is likely swimming–as was mine. Let’s do a lightning round then, and demystify each of these concepts. First: Logistic Regression, perhaps the simplest process, and the backbone of many others. We read in a set of inputs in an attempt to predict a single binary output (it’s either true or it isn’t!) Ironically, despite having regression in the name, this is actually a classification model. We can change it to become a probability predictor, though–that’s what the equation is for. The efficacy here is measured through a Receiver Operating Characteristics curve, or ROC. By plotting the area beneath that curve (AUC for short), we get a value that approaches 1, with higher values representing a more efficient model!

 

 

 

These aren’t enough for our problem, though. We aren’t predicting something binary, but rather we want to see values over time. That’s where the latter two models come in! These models can be used for classification and regression, and thus are a bit more complicated. Once again, I’m hardly an expert–our job is simply to understand the tools in our belt so as to apply them properly. GPs are probability-based, non-parametric predictors: That is to say, instead of taking simple inputs to get an output, the GP attempts to align the data across possible functions it could represent and presents you with the probabilities that generates. Once again, a fascinating system, but not quite aligned with our needs. That brings us to the last and certainly not least!

Neural networks are probably the most complex of the systems I’ve looked into in terms of finding a fit for this project. Just the basics took me a whole whiteboard and a few hours of reading–and I’ve only just started covering the surface. But at their most basic, here’s the idea: We create neurons, units of processing that adjust values according to a parameter weight and bias, normalize that generated value, and pass them onward. We create column-like layers of neurons and stack them together, chaining the decision-making of each layer together to create a more complex decision process. The inputs and outputs are visible, but each layer in between is generally hidden–all of this to simulate a more humanlike function of learning. Neural networks themselves come in many forms, utilizing different algorithms to tweak the decision weights, measure loss, and improve over time. We’re even in luck, that a specific type of neural network is suited perfectly to our issue of photobioreactor optimization!

That brings me to the final project of the week, getting a working demo of one of these models fired up. The theory is great, and nothing gets built without at least a surface-level knowledge of the underlying system. However, I thought it’d be prudent to study a functional example–and more importantly, to get it working on my own hardware. Much of my time was spent studying this demo code from the Keras/Tensorflow module of Python; libraries specifically built to aid in creating neural networks. It walks through the entire process of generating a simple Recurrent Neural Network (RNN), one designed around a very similar problem to our own: predicting future temperatures based on past weather conditions. Kudos to the Keras devs for such excellent documentation!

That brings us to today, where we stand as of the end of week one. I’m happy to say we’re feeling confident in our progress thus far! The project’s been broken into key components, and each of us has an area of expertise to put our time towards. Next week, our final teammate will be back from traveling, and we can get into the nitty-gritty of not only implementing a working model but ensuring its educational value for the Bioengineering concept at hand. Thanks for reading this far, and have a great day!

-Maddie

Further Reading:
[1] Victor Zhou – Intro to Neural Networks

[2] Victor Zhou – Keras For Beginners

[3] Katherine Bailey – Gaussian Processes for Dummies

[4] Arthur Mello – Logistic Regression: The Basics