If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
A Continued learning
A.1 What next?
It is one thing to learn the principles of how to do reproducible research. It is quite another thing to do so in daily practice. So, how can we practice these skills and tools we learned during the course?
- If you have your own data already, then it’s easy: Just start using these tools bit by bit. Slowly and steadily use the tools from this course and continue learning. It isn’t a race, use what you can without getting totally overwhelmed.
- If you don’t have access to your data yet, check out Section B.5 in the Resources Appendix B.
- If your collaborators or supervisor don’t use these tools from the course,
are not supportive, or are supportive but not able to learn and use these tools
themselves, e.g. they are too busy, there are several steps you can take. This
situtation is definitely challenging and is likely to be most commonly
encountered. Use the tools as best you can, small bits at a time so you continue
learning but don’t get completely overwhelmed with all the new things and ways
to do things. Below are some potential small steps to take that you can choose from
to start incorporating and using R and reproducibility in your work:
- As much as possible, setup your projects, folders, and files in a more reproducible way (e.g. through using the structure created from the prodigenr package).
- Create all your figures entirely in R and using R scripts or R Markdown files.
- Write everything research related in R Markdown and convert to a Word document when you need to send to co-authors. If they make edits or comments, include the edits in the original R Markdown file, and do not keep them in the Word document.
- Start slowly making use of Git, even if you can’t or are not comfortable yet with sharing on GitHub. Git and GitHub are two separate things and Git still can be used on your computer without putting it up online.
- Use R entirely to wrangle and clean your data rather than, e.g. opening up Excel and editing the data there.
- If you’re restricted to working with your data in a virtual remote environment (e.g. in Denmark Statistics), you may not have authorization to install some programs. However, most remote environments have the latest software used for data analysis type tasks. Check out Section A.2 below for details about doing reproducible research with R in the Denmark Statistics servers.
A.2 Reproducible research with R in the real world
This section showcases how the reproducible workflow we’ve used in the course is possible in a restricted virtual environment. The case example: The virtual desktop environment used when working on Statistics Denmark (DST)’s research servers. One thing to note, which may be a bit confusing is that DST also uses the term “project” but in a very different meaning compared to RStudio R Projects:
- An R project is a collection of R scripts and documentation in a single, “parent” folder.
- A DST project is a research project. Each DST project can access separate folders and data, and any user with access to a project can log on and work within those folders and data.
While in the virtual environment, you very likely will not have administrator
rights, so you will not be able to install programs yourself. Often, you will be
logged into a virtual computer with no internet access. From there, you will not
be able to access remote locations like GitHub or the online archive where R
packages are normally installed from with
A.2.1 Installing software
The software requirements are the same as for this course, except that the system administrators must do the setup. The good news is that once they’ve done it, everything is pretty much set up for us! Contact your system administrator and ask them if R and Git are supported. If not, then it’s not a lot to ask the administrators to implement it. That’s a basic service. In the case of Statistics Denmark, they’ve pre-installed Git, R and RStudio, as shown in Figure A.1.
A.2.2 Packages are pre-installed
To allow access to packages in the restricted environment, DST has pre-installed
all packages on CRAN, so we will be able to directly access any package from
CRAN with the
While we won’t be able to use packages from sources other than CRAN (like the r3 package developed for this course), we do have access to the >15,000 packages on CRAN (this also means that opening and browsing the packages tab can slow your session down, shown in Figure A.2).
A.2.3 Setting up our project
The prodigenr package is available, so we’ll set the project up like we normally would (refer to Figure 7.1). We may get a message nudging us to set up our Git configuration and not just use the defaults. The defaults should be based on your log-on ID (and be somewhat sensible) here, but we can set them ourselves later.
A.2.4 Activating Git for the project
Normally, we would activate Git with the
usethis::use_git() command. This may
not always work in the DST environment. Instead, we’ll activate Git through the
RStudio menu tab:
Tools -> Version Control -> Project Setup -> Version control system: Git, shown in Figure A.3.
After restarting RStudio, you should see the Git icons and tabs. Now we can set our Git config (since we don’t have access to the r3 package, we’ll set it with the git2r package that allows R to interact with Git):
And now we can make our initial commit to the local Git repository, shown in Figure A.4.
We won’t be able to access an online remote Git repository like GitHub, since we have no internet access from the virtual desktop, but don’t worry! We can still use Git with a local repository on the virtual desktop to track our own work. In a few steps, we’ll set up a remote repository and get all the functionality of Git in a collaborative workflow as well!
A.2.5 Setting up a Git remote repository in the virtual environment
If we’re working in a team, we will to want to have a remote repository for keeping the latest versions available to everybody on the team. In the version control session, we created a remote repository through GitHub’s website, and then created a local repository from it by cloning it. This time, we’ll do it all from the terminal, and since we’ve already set up a local repository, we’ll do it the other way around:
- Create a blank remote repository in a local folder that all team members have access to.
- Connect our local repository to it.
- Push the commits from our local repository to it.
The only difference here is that the location of the remote is not a GitHub URL and we don’t create the remote on our web browser with the GitHub interface. In fact, once we’ve created the remote repository, we link our local repository to it with the same two commands we used in the Git remote session.
This time, we create the remote repository with the Windows interface by using Git Bash in 3 steps:
First we create the folder where we want our remote to be located. We’ll put it at the root of our DST project directory and call it
Next, we go to the new directory and open Git Bash:
Git Bash is a program for operating Git through a command-line interface in Windows. It is similar to the terminal we used in RStudio. We should see a line indicating the current directory and below it a
$sign followed by a blinking bar indicating that Git Bash is ready to accept new commands.
Now, we use a new command:
git init --bare, which converts the current directory to a blank Git repository.
These three steps accomplish the same creation of a remote repository, which we covered in the course session with the GitHub interface. The rest of the setup (linking the local repository to the remote repository and pushing the contents of the local repository to the remote) is identical to the course session, except the location of the remote is a folder, not a GitHub URL, as shown in Figure 8.9.
So, let’s go back to our RStudio project and open the RStudio Terminal and complete the process:
And that’s it! Now we have version control shared between all users of the DST project integrated into RStudio.
Alternatively, if you’re more comfortable working in the terminal, the remote repository can also be set up by just using the terminal.
# Creates the folder for the remote. mkdir -p E:/project_directory/git-remote # Changes the Git working directory to the remote folder. cd E:/project_directory/git-remote/ # Creates an empty remote repository in the remote folder. git init --bare # Changes the Git working directory back to the R project folder. cd E:/project_directory/LearningR # Links the project to the remote repository git remote add origin E:/project_directory/git-remote # Pushes the local repository to the remote git push -u origin master
Are we missing out by not having access to GitHub? Not in regards to version control. Since we’re already working on a remote virtual desktop where our project collaborators also have access, the repositories here serve the same function as a remote repository like GitHub normally would:
- A platform for collaboration, where all team members can keep their edits synchronized and work in parallel without losing track of changes.
- A secure back-up of the entire history of the project in case something happens or goes wrong.
A.2.6 How R can help us deal with common issues in restricted environments
A.2.6.1 Keeping track of file paths
The preset folder structures are often ugly, long, and maze-like. While they
can’t be edited, using a workflow that takes advantage of R projects and
here::here() shown in this course, you don’t have to deal with the long path
names once you’ve moved everything into the project folder.
A.2.6.2 Using foreign data sources
Sometimes, these types of servers will be using a specific software and data file format. Perhaps the data is in a different format, or it is stored in a database, e.g in an SQL server. R is extremely flexible, and you can open practically any data format in R using R scripts with familiar R syntax. This includes proprietary data formats such as SAS (which is the default raw data format used by DST) or ODBC-compatible databases such as SQL (often used for centralized storage of large amounts of data). The intermediate course material covers how to handle foreign formats and databases.
A.2.6.3 Some drawbacks to working in a virtual environment
To name a few:
- In addition to being a repository, GitHub can also be used as a forum for communicating about the project workflow (and get feedback/contributions from the public if using a public repository). Not having this project management aspect can be a limitation.
- You cannot copy-paste or conveniently import anything into the virtual environment, including formatting tables, code chunks, or scripts. Certain scripts (e.g. AutoHotkey) allow you automatically “type out” the contents of your local clipboard onto the virtual desktop, but in practice these do not work on most virtual machine platforms due to keyboard driver issues - often creating dangerously erratic keyboard behavior instead. Using these types of scripts is not recommended.
- The installed versions of R, RStudio, and the packages will only be as up-to-date as the system administrators keep them. Again, if something is out of date and you need it updated, don’t be afraid to ask them to help you out. However, it can be tricky to keep the updates of all the different packages compatible with each other and with new versions of R, and keeping everything on the cutting edge of new releases is not feasible. In the case of DST, you will be running versions that are often several months behind the latest “real-world” releases, even in the most updated cases.