Journey to the Cloud

My data are getting bigger every year, and it’s becoming a barrier to work with some projects on my laptop (desktop capabilities are still OK for now). Perhaps more importantly, I’m introducing more real-world data work as part of my classes, and many students just don’t have much storage on their personal laptops. They basically have $2,500 internet machines with nice displays. Anyway, if my students are going to actually use real data, they either need to get new computers, do everything in the library, or I can allow them access to cloud server storage as part of my class. I chose the latter approach.

Transitioning to the cloud wasn’t too bad. Here are some steps:

  1. Get comfortable with AWS instances and images. It’s easy enough to set this stuff up once you get an AWS account, but it can feel a bit like learning a new language at first. Since I can do all of my work through R, I relied heavily on a couple of tutorials: one from Sebastian Schweer here and another from Jagger Villalobos here. In the end, I basically used a pre-built AMI with R and RStudio, generously managed and provided by Louis Aslett, with some adjustments to allow easy access online (Elastic IP, etc.).

  2. Connect to the AWS instance with SSH. I know Ubuntu has SSH already built in, but I’m not really a programmer. I ended up using Termius for Linux. And I relied on some great resources from my colleague, David Jacho-Chavez, to get everything set up. He has an awesome tutorial here. It focuses on JupyterHub but a lot of the same stuff works with the R-centric AMI as well.

  3. Relational Databases! Termius has SFTP functionality for a price, but since I’m revamping some things anyway, why not go all in and commit to SQL?! I’m still learning some of this, but for now, I’m using PostgreSQL as way to maintain access to all of my research data through the cloud. Belicia Rodriguez has a great tutorial for this here. Note that this tutorial focuses on setting up the PostgreSQL server on AWS, but you might also want to set it up on your own local computer if you want to move the data around. If you’re using a separate attached EBS (as I am), then you’ll also need to change some PostgreSQL settings to get it to save the data in the right drive (and you’ll need to give postgres user some extra privileges). Here are some good instructions.

  4. Expanded Data. I’m sure there are better ways to do this, but I was worried about storing my data on the same volume as the root volume for a given instance. I’m concerned that, if I terminate the instance, then I’d also lose all of the data. So, I created a separate Elastic Block Storage (EBS), and I attached that to my instance. This essentially acts as its own partition on my instance. Once it’s attached in AWS, you need to make this volume available for use. Amazon has some instructions for this here.

  5. SFTP. Although I’m using PostgreSQL for most data management, there are some cases where I just don’t want to make all of my data files into nice SQL databases. For example, publicly available data from Medicare Advantage are a huge headache and consists of hundreds of different excel or csv files, many with different names and different formatting. I find it much easier to work with these raw files directly in R and then do some clean-up after the fact. For that workflow, I just want to raw data available without dealing with SQL. So, I use FileZilla to move the data from my own computer to the AWS volume. This is pretty straightfoward…just add the AWS key and copy the AWS DNS as the FileZilla host. Make sure you’ve set it to SFTP and that you’ve added the proper username. After setup, you should be able to see the drives and move data freely from one place to another. Finally, you’ll need to make sure you have write priviledges to the new EBS. For this, try sudo chown -R ubuntu /data (assuming the user name is ubuntu and the directory you’d like to grant access to is “data”).

In the end, I (and my students) can work with R and RStudio simply through a web browser and have access to all of the necessary data (even very large datasets) entirely through the cloud.