Want to help out or contribute?

If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submit a merge request on GitLab with the feedback or suggestions.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

A Continued learning

A.1 What next?

It is one thing to learn the principles of how to do reproducible research. It is quite another thing to do so in daily practice. So, how can we practice these skills and tools we learned during the course?

  • If you have your own data already, then it’s easy: Just start using these tools bit by bit. Slowly and steadily use the tools from this course and continue learning. It isn’t a race, use what you can without getting totally overwhelmed.
  • If you don’t have access to your data yet, check out Section B.5 in the Resources Appendix B.
  • If your collaborators or supervisor don’t use these tools from the course, are not supportive, or are supportive but not able to learn and use these tools themselves, e.g. they are too busy, there are several steps you can take. This situtation is definitely challenging and is likely to be most commonly encountered. Use the tools as best you can, small bits at a time, so you continue learning but don’t get completely overwhelmed with all the new things and ways to do things. Below are some potential small steps to take that you can choose from to start incorporating and using R and reproducibility in your work:
    • As much as possible, setup your projects, folders, and files in a more reproducible way (e.g. through using the structure created from the prodigenr package).
    • Create all your figures entirely in R and using R scripts or R Markdown files.
    • Write everything research related in R Markdown and convert to a Word document when you need to send to co-authors. If they make edits or comments, include the edits in the original R Markdown file, and do not keep them in the Word document.
    • Start slowly making use of Git, even if you can’t or are not comfortable yet with sharing on GitHub. Git and GitHub are two separate things and Git can still be used on your computer without putting it up online.
    • Use R entirely to wrangle and clean your data rather than, e.g. opening up Excel and editing the data there.
  • If you’re restricted to working with your data in a virtual remote environment (e.g. in Denmark Statistics), you may not have authorization to install some programs. However, most remote environments have the latest software used for data analysis type tasks. Check out Section A.2 below for details about doing reproducible research with R in the Denmark Statistics servers.

A.2 Reproducible research with R in the real world

This section showcases how the reproducible workflow we’ve used in the course is possible in a restricted virtual environment. The case example: The virtual desktop environment used when working on Statistics Denmark (DST)’s research servers. One thing to note, which may be a bit confusing is that DST also uses the term “project” but in a very different meaning compared to RStudio R Projects:

  • An R project is a collection of R scripts and documentation in a single, “parent” folder.
  • A DST project is a research project. Each DST project can access separate folders and data, and any user with access to a project can log on and work within those folders and data.

While in the virtual environment, you very likely will not have administrator rights, so you will not be able to install programs yourself. Often, you will be logged into a virtual computer with no internet access. From there, you will not be able to access remote locations like GitHub or the online archive where R packages are normally installed from with install.packages().

A.2.1 Installing software

The software requirements are the same as for this course, except that the system administrators must do the setup. The good news is that once they’ve done it, everything is pretty much set up for us! Contact your system administrator and ask them if R and Git are supported. If not, then it’s not a lot to ask the administrators to implement it. That’s a basic service. In the case of Statistics Denmark, they’ve pre-installed Git, R and RStudio, as shown in Figure A.1.

Git, R, and RStudio are already installed.

Figure A.1: Git, R, and RStudio are already installed.

A.2.2 Packages are pre-installed

To allow access to packages in the restricted environment, DST has pre-installed all packages on CRAN, so we will be able to directly access any package from CRAN with the library() function.

While we won’t be able to use packages from sources other than CRAN (like the r3 package developed for this course), we do have access to the >15,000 packages on CRAN (this also means that opening and browsing the packages tab can slow your session down, shown in Figure A.2).

CRAN packages are already installed.

Figure A.2: CRAN packages are already installed.

A.2.3 Setting up our project

The prodigenr package is available, so we’ll set the project up like we normally would (refer to Figure 7.1). We may get a message nudging us to set up our Git configuration and not just use the defaults. The defaults should be based on your log-on ID (and be somewhat sensible), but we can set them ourselves later.

A.2.4 Activating Git for the project

Normally, we would activate Git with the usethis::use_git() command. This may not always work in the DST environment. Instead, we’ll activate Git through the RStudio menu tab: Tools -> Version Control -> Project Setup -> Version control system: Git, shown in Figure A.3.

Setting up Git for the project.

Figure A.3: Setting up Git for the project.

After restarting RStudio, you should see the Git icons and tabs. Now we can set our Git config (since we don’t have access to the r3 package, we’ll set it with the git2r package that allows R to interact with Git):

git2r::config(user.name = "My Name Here", 
              user.email = "mynamehere@mymailhere.com",
              global = TRUE)

And now we can make our initial commit to the local Git repository, shown in Figure A.4.

Initial commit to local repository.

Figure A.4: Initial commit to local repository.

We won’t be able to access an online remote Git repository like GitHub, since we have no internet access from the virtual desktop, but don’t worry! We can still use Git with a local repository on the virtual desktop to track our own work. In a few steps, we’ll set up a remote repository and get all the functionality of Git in a collaborative workflow as well!

A.2.5 Setting up a Git remote repository in the virtual environment

If we’re working in a team, we will to want to have a remote repository for keeping the latest versions available to everybody on the team. In the version control session, we created a remote repository through GitHub’s website, and then created a local repository from it by cloning it. This time, we’ll do it all from the terminal, and since we’ve already set up a local repository, we’ll do it the other way around:

  1. Create a blank remote repository in a local folder that all team members have access to.
  2. Connect our local repository to it.
  3. Push the commits from our local repository to this new “remote” repository.

The only difference here is that the location of the remote is not a GitHub URL and we don’t create the remote on our web browser with the GitHub interface. In fact, once we’ve created the remote repository, we link our local repository to it with the same two commands we used in the Git remote session in Section 8.6.

This time, we create the remote repository with the Windows interface by using Git Bash in 3 steps:

  • First we create the folder where we want our remote to be located. We’ll put it at the root of our DST project directory and call it git-remote:

    Create new folder for the remote repository.

    Figure A.5: Create new folder for the remote repository.

  • Next, we go to the new directory and open Git Bash:

    Open Git Bash here.

    Figure A.6: Open Git Bash here.

    Git Bash is a program for operating Git through a command-line interface in Windows. It is similar to the terminal we used in RStudio. We should see a line indicating the current directory and below it, a $ sign followed by a blinking bar, indicating that Git Bash is ready to accept new commands.

  • Now, we use a new command: git init --bare, which converts the current directory to a blank Git repository.

    'git init --bare' creates the necessary files to convert the folder into a blank Git repository.

    Figure A.7: ‘git init –bare’ creates the necessary files to convert the folder into a blank Git repository.

These three steps accomplish the same creation of a remote repository, which we covered in the course session with the GitHub interface. The rest of the setup (linking the local repository to the remote repository and pushing the contents of the local repository to the remote) is identical to the course session, except the location of the remote is a folder, not a GitHub URL, as shown in Figure 8.9.

So, let’s go back to our RStudio project and open the RStudio Terminal and complete the process:

git remote add origin E:/project_directory/git-remote 
git push -u origin master

And that’s it! Now we have version control shared between all users of the DST project integrated into RStudio.

The remote repository now works for pushing and pulling through the RStudio Git interface.

Figure A.8: The remote repository now works for pushing and pulling through the RStudio Git interface.

Alternatively, if you’re more comfortable working in the terminal, the remote repository can also be set up by just using the terminal.

# Creates the folder for the remote.
mkdir -p E:/project_directory/git-remote 

# Changes the Git working directory to the remote folder.  
cd E:/project_directory/git-remote/ 

# Creates an empty remote repository in the remote folder.  
git init --bare 

# Changes the Git working directory back to the R project folder.  
cd E:/project_directory/LearningR 

# Links the project to the remote repository  
git remote add origin E:/project_directory/git-remote 

# Pushes the local repository to the remote
git push -u origin master 

Are we missing out by not having access to GitHub? Not in regards to version control. Since we’re already working on a remote virtual desktop where our project collaborators also have access, the repositories here serve the same function as a remote repository like GitHub normally would:

  • A platform for collaboration, where all team members can keep their edits synchronized and work in parallel without losing track of changes.
  • A secure back-up of the entire history of the project in case something happens or goes wrong.

A.2.6 How R can help us deal with common issues in restricted environments

A.2.6.1 Keeping track of file paths

The preset folder structures are often ugly, long, and maze-like. While they can’t be edited, using a workflow that takes advantage of R projects and here::here() shown in this course, you don’t have to deal with the long path names once you’ve moved everything into the project folder.

A.2.6.2 Using foreign data sources

Sometimes, these types of servers will be using a specific software and data file format. Perhaps the data is in a different format, or it is stored in a database, e.g in an SQL server. R is extremely flexible, and you can open practically any data format in R using R scripts with familiar R syntax. This includes proprietary data formats such as SAS (which is the default raw data format used by DST) or ODBC-compatible databases such as SQL (often used for centralized storage of large amounts of data). The intermediate course material covers how to handle foreign formats and databases.

A.2.6.3 Some drawbacks to working in a virtual environment

To name a few:

  • In addition to being a repository, GitHub can also be used as a forum for communicating about the project workflow (and get feedback/contributions from the public if using a public repository). Not having this project management aspect can be a limitation.
  • You cannot copy-paste or conveniently import anything into the virtual environment, including formatting tables, code chunks, or scripts. Certain scripts (e.g. AutoHotkey) allow you automatically “type out” the contents of your local clipboard onto the virtual desktop, but in practice these do not work on most virtual machine platforms due to keyboard driver issues - often creating dangerously erratic keyboard behaviors instead. Using these types of scripts is not recommended.
  • The installed versions of R, RStudio, and the packages will only be as up-to-date as the system administrators keep them. Again, if something is out of date and you need it updated, don’t be afraid to ask them to help you out. However, it can be tricky to keep the updates of all the different packages compatible with each other and with new versions of R, and keeping everything on the cutting edge of new releases is not feasible. In the case of DST, you will be running versions that are often several months behind the latest “real-world” releases, even in the most updated cases.