Since the dawn of time, man has wanted to know what is going to happen tomorrow, next week, next year, and even the next decade. Lucky for us here in the present day, computer algorithms and large swaths of data have started to give us the ability to predict the future more reliably.
In this blog, we will build a flow rate prediction algorithm for the Norfork River in Arkansas. Both farmers and recreationists are heavily dependent on the flow rate of the river. As the river flows faster, the risk of flooding increases and flooding can destroy crops for farmers and ruin vacations for tourists.
My goal with this flow rate prediction algorithm was to give a better idea of what the next 4 days of river flow will look like. Below, I will discuss the data science process that I underwent to make these predictions.
Step 1: Fully Understanding the Problem
The first and most important concept of data science is that you are attempting to take an object, event, or thing from our world and model it on a computer. Because of this, data scientists need to have a very strong understanding of the domain that they are working in, which involves constant questioning, curiosity, and more research.
Being able to fully understand all parts of a particular problem is going to get you the farthest starting out. In this case, there were multiple times when I did more research to expand my understanding as we went along. This resulted in a need to do more data collection.
Here’s an example. At the beginning of the process, I did not have a full understanding of what dams actually flowed into the river and how those dams were used for power. This was important because the dams flow rates affected downstream. Once we figured out there were two dams, it meant we needed to gather double the data for multiple locations.
Understanding the problem, asking questions of stakeholders, and doing more research is always the first and most important step.
Step 2: Gathering and Cleaning
Next, it’s time to look at the data. Good data is hard to come by, and most of the time, it will require both pulling from multiple sources and cleaning. If you ever encounter clean data from the get-go, count your lucky stars. You either had a good data engineer or need to buy a lottery ticket right now!
For this project, I needed to gather data from four different sources. There was a historical weather service that was in an excel format, a weather API that came in as a JSON, a government website that required a web scraper to turn into a data frame, and finally, a map with the water level of the lakes that also required a web scraper.
Gathering data from multiple sources requires careful thought about how the data will be joined together and stored. It is necessary to check and double-check that the data is representing what you think it is. It can be very easy to trust the script and just assume that the right thing is happening, but the second you start assuming is the second you will run into a variety of problems.
Cleaning the Data
You might produce a model that works great on the training set, but things will become disastrous when new data is predicted from the model. As the saying goes, “Garbage in, garbage out.” In other words, if you feed a model garbage data, you will get garbage results when you go to production. So, after gathering all of the data, cleaning it is very important.
In order to make a supervised model that does not need to be retrained like the traditional ARMA, ARIMA, or SARIMA time series models, there was a great deal of work to shift columns and collect moving averages based on our Auto Correlation Functions and Partial Autocorrelation Functions. These functions describe how much of a moving average there is in the target data as well as any seasonality and autocorrelation.
Once we have gathered and cleaned the data, we are ready to get to begin model building.
Step 3: Building the Model
Building the model is what probably comes to mind when you hear the words “data science.” That’s what this step is all about!
Using the H2O AutoML package, I was actually able to get past many of the difficulties of model building. The package allows us to take our cleaned data and simply run the autoML function. The function will help us pick the best model based on the criteria we set. For example, if we were doing a classification problem, we might look at the AUC.
In this case, since it was a regression problem where we are trying to predict a flow rate (a continuous variable), I went with the Mean Squared Error. You can think of the Mean Squared Error as the distance of a point from the prediction squared and summed with all the other points. The model is attempting to minimize this value.
Another great feature of the autoML function is that we get a variety of models automatically tuned for us. Once the model has been successfully completed, we can save the model and use it in our pipelines or on APIs for prediction.
Once the modeling is complete, it is important to stay ever vigilant with the model. It will be used on data that it has not seen before. With that in mind, it is important to monitor the model’s performance and, when necessary, figure out why some predictions are not working as expected.
Data science is all about continuous improvement; you should be testing, monitoring, doing more research, and improving what you are doing at all times. This includes gathering new data or thinking of data that could be causing an effect that you were not familiar with.
Step 4: Deploying the Model
The final part of data science is deploying the model. There are a variety of ways to deploy models, but one of the best ways is to use the data pipelines you have and hook them up to a cloud platform. Cloud platforms are cheap and easy to use. Setting up functions, databases, and the model itself in the cloud will allow peace of mind that the data is safe and working as intended.
The raw data that we started with gave us a Simple Linear Regression Accuracy of .7%. That means we could not even get within 1% of the actual predicted value. However, after careful gathering, cleaning, and modeling (as described above), we had increased that accuracy to a whopping 84.7%.
This is a gigantic improvement. Currently, our production model is still within 80-85% accuracy. The results were fantastic, and we were very satisfied.
If you have any questions about this project or data modeling in general, please feel free to reach out to me by leaving a comment below. Thank you for reading! If you enjoyed this post, check out the many others on the Keyhole Dev Blog.