If you find any typos, errors, or places where the text may be improved, please let us know by providing feedback either in the feedback survey (given during class) or by using GitLab.
To maximize how much you learn and how much you will retain, you as a group will take what you learn in the course and apply it to create a reproducible project. This project, based on a simple data analysis of a dataset of your choice, will have a GitHub repository and an HTML document as a report to demonstrate the reproducibility of the analysis. The dataset cannot be the same as the one we used in class, and it must be an open dataset obtained from an online archive (we recommend a few below). Preferably, different teams choose different datasets so not everyone is using the same dataset. The lead instructor will coordinate this.
In the last session of the course, you as a group will give a ~5-10 min presentation of your analysis and report. Before the presentation, the lead instructor will download your project from GitHub and re-generate your report to check that it is reproducible. Then, during the presentation, you as a group will:
- Show your project on GitHub.
- Show your generated HTML report.
- Describe what you did and the reasons behind what you did.
- Explain any challenges you encountered or things you would do differently.
- Explain anything you liked and would do more of in the future.
Throughout the course, you will work together as a group on a few exercises, especially the final exercises in each session that are not related to the assignment. Near the end of the course you will have dedicated time to work on your assignment and complete the tasks given below. Before you do any tasks though, decide as a group who will present the project at the end. We’d recommend assigning one or two people to be the “presenters”.
It is important to note that you will be using Git and GitHub to manage your group assignment, so you should take care to set this up correctly before doing any substantial work on the assignment. You are likely to push and pull a lot of content (which you will learn how to do later), so you will need to maintain regular and open communication with your team members.
To make things a bit easier for you to start on the project sooner, we’ve created the repositories with the appropriate file and folder structure for you. So, your specific tasks (based on lesson order) are to:
All team members need to clone your team’s repository to their own computer, using the process we learned in cloning exercise @ref(#ex-clone-from-github) in the Version Control session.
Find an open dataset to use for your analysis and report. You have two options: 1) choose from one of the datasets provided below; or 2) search for another dataset from an online data archive (e.g., figshare). If you choose the second option, make sure to select a dataset that isn’t too big (maybe max. 4 Mb) and contains some basic quantitative data that allows you to use some basic functions on the dataset. Don’t spend too much time on this task. Here are some datasets we’ve identified that would be good for the assignment:
Assign one team member to download the dataset and put it into the
data-raw/folder of your project.
- For the GitHub-linked datasets above, you should be able to click on the “Raw”
or “Download” buttons, then use the keyboard shortcut (
Ctrl-S) to save the page (as a
- If there is any documentation on the dataset, e.g. variable definitions,
put those files in the
data-raw/folder as well.
- This team member should then add and commit the new file(s) in the project, so it gets saved to the Git history. Then push the changes to the team’s GitHub repository.
- Next, have all team members pull the new changes to their own computers.
- For the GitHub-linked datasets above, you should be able to click on the “Raw” or “Download” buttons, then use the keyboard shortcut (
Use the created
data-raw/original-data.Rscript to clean up and prepare the raw dataset (as covered in Data Management). A (strongly) suggested way of working here is to assign one team member as the “coder” and as a group discuss and decide on how to clean up the data by telling the “coder” what to do. After you’ve cleaned it up enough that you as a group are satisfied, save the new dataset in a folder called
usethis::use_data(). The coder should eventually add and comment these changes to the data before pushing to the team GitHub repository so the other members can access the data too.
Create an R Markdown file named
YOURNAMEwith your actual name) in the
doc/folder of your project (covered in Reproducible Documents). This means that each team member will have their own R Markdown document to work in without conflicting with the other team members. Note, if there are several members with the same name, make sure your
Rmddocuments are unique (for instance, by using your full name). Distribute the tasks (listed below) so each team member is (mostly) doing something different, like doing some simple analyses of the dataset, making figures (as covered in Data Visualization), and so on. Keep using the “Git workflow” by adding to the staged area, commit, push, and pull the changes you’ve made. You may end up with some merge conflicts, ask instructors or helpers for support to resolving them.
- Create section headers (e.g. “Introduction”, “Methods”, “Results”, “Discussion”).
- In a “Methods” section, write up a basic description of what the dataset is, where you got it from, and what you did to process or analyze the data.
- Create at least one table in the report that is generated from the data in a “Results” section.
- Create several different figures of your data that is generated in the “Results” section.
- Try to “knit” your document to HTML often, to make sure the analysis is reproducible.
- Note: For now, do not add and commit the HTML output file.
Once you as a group are happy with the results, each team member should write in the “Discussion” section of their report file a few sentences on some things you liked about doing the project with the tools you learned and a few sentences on some challenges you experienced. Add any thoughts that you as a group came up with as well.
Assign one team member as the “report coordinator” and have them merge all the individual
report.Rmdfile. They do this by copying and pasting the content of the reports into the appropriate sections of the report. After they are done, add and commit the
report.Rmd. Lastly, delete all the report files except the
report.Rmdand add and commit these deleted files to the Git history. There should only be the
report.Rmdfile in the repository.
- As the report coordinator merges the report file, assign another team member
to update the
README.mdfile with some details about the project and about who was on the team. Save, add, commit, and push these changes to the GitHub repository.
- As the report coordinator merges the report file, assign another team member to update the
Generate an HTML of the report and commit it to Git, then push it up to GitHub. Include all the updated code and files on GitHub for the presentation.
These tasks may seem like a lot, with a lot of new terminology and tools to use. But don’t worry! We will be going over many of these topics and you will have time to complete the project over the three days. Make use of the website to remind you and use our cheatsheet as a reference.
At the end, the lead instructor will download each of the teams’ Git projects, knit the R Markdown documents to test its reproducibility, and show them on the screen for each team to present on.
- Project used Git and is on GitHub.
- Included a good README describing the project and the team.
- Separated “raw data” from “cleaned data”.
- Used scripts to clean the data.
- Included R code within an R Markdown file to show results and make figures.
- Wrote about methods and datasets.
- Went over challenges and general experiences.
- Generated an HTML file from an R Markdown file.
What we expect you to do for the group project:
- Use Git and GitHub throughout your work.
- Work collaboratively as a group and share responsibilities and tasks.
- Use as much of what we covered in the course to practice what you learned.
What we don’t expect:
- Complicated analysis or coding. The simpler it is, the easier is to for you to do the coding and understand what is going on. It also helps us to see that you’ve practiced what you’ve learned.
- Clever or overly concise code. Clearly written and readable code is always better than clever or concise code. Keep it simple and understandable!
Essentially, the group project is a way to reinforce what you learned during the course, but in a more relaxed and collaborative setting.