class: center, middle, inverse, title-slide # Finding and obtaining open datasets --- layout: true --- ## Where is Open Data in the Open Science Universe? - Open data is only a small part of the open science movement .center[ <img src="../images/OpenUniverse.png" width="80%" height="80%" /> ] .footnote[Image source from Foster Open Science [(www.fosteropenscience.eu/resources)](https://www.fosteropenscience.eu/resources).] --- ## Open Data Principles .pull-left[ - Open data is not only about datasets you can download from the internet - Accessibility is only one part of Openness ] .footnote[  Go FAIR Initiative (https://www.go-fair.org/fair-principles/) ] --- ## Open Data Principles .pull-left[ - Open data is not only about datasets you can download from the internet - Accessibility is only one part of Openness - Open data should be FAIR: - Findable - Accessible - Interoperable - Reusable ] .pull-right[ <img src="../images/gofair.png" width="80%" height="80%" /> ] .footnote[  Go FAIR Initiative (https://www.go-fair.org/fair-principles/) ] --- ## FAIR data - **Findable**: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. - **Accessible**: Once the user finds the required data, she/he needs to know **how** can they be accessed, possibly including authentication and authorisation. - **Interoperable**: The data usually need to interoperate with applications or workflows for analysis, storage, and processing. (Meta)data should use a formal, accessible, shared language or format - **Reusable**: Data and metadata should be well-described so that they can be replicated and/or combined in different settings. .footnote[  Go FAIR Initiative (https://www.go-fair.org/fair-principles/) ] --- ## Finding Open data - Starting points: - Data Resources known in your network - Publication with link to data source / repository - Search in public repositories --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed .pull-right[ <img src="../images/CPRD.png" width="90%" height="90%" /> ] - Commercial / Paid access - e.g. [(CPRD)](https://www.cprd.com/) --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed - Commercial / Paid access - e.g. [(CPRD)](https://www.cprd.com/) - Data sharing within a project/collaboration (restricted) --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed .pull-right[ <img src="../images/whitehall.png" width="90%" height="90%" /> ] - Commercial / Paid access - e.g. [(CPRD)](https://www.cprd.com/) - Data sharing within a project/collaboration (restricted) - Gated Data Sharing (Application + Evaluation of proposal, processing fee) - e.g. Whitehall II Study [(Whitehall II Study)](https://www.ucl.ac.uk/epidemiology-health-care/research/epidemiology-and-public-health/research/whitehall-ii/data-sharing) --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed .pull-right[ <img src="../images/ELSA.png" width="90%" height="90%" /> ] - Commercial / Paid access - e.g. [(CPRD)](https://www.cprd.com/) - Data sharing within a project/collaboration (restricted) - Gated Data Sharing (Application + Evaluation of proposal, processing fee) - e.g. Whitehall II Study [(Whitehall II Study)](https://www.ucl.ac.uk/epidemiology-health-care/research/epidemiology-and-public-health/research/whitehall-ii/data-sharing) - Only registration required - e.g. English Longitudinal Study of Ageing [(ELSA)](https://www.elsa-project.ac.uk/data-and-documentation) accessible via the [(UK Data Service)](https://ukdataservice.ac.uk/) --- ## Known data resources - Wide range of options between fully closed and fully open (FAIR) - Closed .pull-right[ <img src="../images/NHANES.png" width="90%" height="90%" /> ] - Commercial / Paid access - e.g. [(CPRD)](https://www.cprd.com/) - Data sharing within a project/collaboration (restricted) - Gated Data Sharing (Application + Evaluation of proposal, processing fee) - e.g. Whitehall II Study [(Whitehall II Study)](https://www.ucl.ac.uk/epidemiology-health-care/research/epidemiology-and-public-health/research/whitehall-ii/data-sharing) - Only registration required - e.g. English Longitudinal Study of Ageing [(ELSA)](https://www.elsa-project.ac.uk/data-and-documentation) accessible via the [(UK Data Service)](https://ukdataservice.ac.uk/) - No registration required - e.g. [(NHANES)](https://wwwn.cdc.gov/nchs/nhanes/) --- ## Finding Open data - Starting points: - Data Resource known in your network - Publication with link to data source / repository - Search in public repositories --- ## Publications with links to data Journals increasingly encourage publication of (links to) data Let's have a look at the PLOS journals: - Policy requiring researchers to share the data underlying their results or to state why this is not possible - But: Are authors complying with these requirements? [(PLOS Medicine: Diabetes)](https://journals.plos.org/plosmedicine/search?filterJournals=PLoSMedicine&filterSubjects=Medicine+and%20health%20sciences&filterArticleTypes=Research%20Article&q=diabetes&page=1) --- ## Publications with links to data Journals increasingly encourage publication of (links to) data Let's have a look at the PLOS journals: - Policy requiring researchers to share the data underlying their results or to state why this is not possible - But: Are authors complying with these requirements? [(Review by Federer et al)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194768) .center[ <img src="../images/FedererPLOSONE.png" width="85%" height="85%" /> ] --- ## Review of PLOS data statements (Federer 2018) .center[ !(https://journals.plos.org/plosone/article/figure/image?id=10.1371/journal.pone.0194768.t001&size=medium) ] --- ## Finding Open data - Starting points: - Data Resource known in your network - Publication with link to data source / repository - Search in public repositories --- ## Figshare [Figshare: Diabetes](https://figshare.com/search?q=diabetes&sortBy=posted_date&sortType=desc&licenses=1,2,47,52,3,40,49&itemTypes=3&categories=7,48) .center[ <img src="../images/Figshare.png" width="90%" height="90%" /> ] --- ## Dryad [Dryad: Diabetes](https://datadryad.org/search?utf8=%E2%9C%93&q=diabetes) .center[ <img src="../images/Dryad.png" width="90%" height="90%" /> ] --- ## Github Collection [Github: Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets) .center[ <img src="../images/Github.png" width="90%" height="90%" /> ] --- ## Rigsarkivet (Overview and access info) [Rigsarkivet: Sundhed](https://www.sa.dk/da/forskning-rigsarkivet/rigsarkivet-sundhed/) .center[ <img src="../images/Rigsarkivet.png" width="90%" height="90%" /> ] --- ## Open Neuro (MRI and fMRI images) [Openneuro: Diabetes](https://openneuro.org/search/diabetes) .center[ <img src="../images/OpenNeuro.png" width="90%" height="90%" /> ] --- ## How are you allowed to use data you find? - First important step: know who 'owns' the data and what they allow you to do with it - Public Domain: There is no owner, you are allowed to use the data in any way - Data are protected by copyright: but the owner gives you a license to use it in a certain way - Different Open Licenses .pull-left[ <img src="../images/CC_licenses.png" width="70%" height="70%" /> ] .pull-right[ <img src="../images/CClicense_range.png" width="30%" height="30%" /> ] --- ## Different Open Licenses .pull-left[ For any type of 'work', including databases: - Creative Commons: - CC BY-NC (Attribution-NonCommercial) - CC BY-ND (Attribution-NoDerivatives) - CC BY-SA (Attribution-ShareAlike) - CC-BY (only Attribution required) - CC0 (= placing something in the public domain) - Open Data Commons Licenses: - ODC-By (Attribution required) - PDDL (= placing a database in the public domain) ] .pull-right[ Mostly for Open Source Software: - GNU - MIT ] Finding a suitable license for your data: [(Choose a license)](https://choosealicense.com/) or [(Creative Commons Chooser)](https://chooser-beta.creativecommons.org/) --- ## Summary - Open Data is only a part of the Open Science Universe - Open Data should be FAIR (but are often only in part) - There are many different ways of finding Open Data, none are ideal (yet) - Be mindful of the licence attached to a dataset --- ## Links and references .pull-left[ **General Resources** - [Go Fair] (International initiative to promote FAIR data) - [Foster Open Science] (EU project with general resources on Open Science) - [Open Science Framework] (Resources for Open Science) - [Center for Open Science] (Resources for Open Science) ] .pull-right[ **Data Repositories** - [Dryad] (Mostly Manuscript-linked) - [Figshare] (Mostly Manuscript-linked) - [European Data Portal] (Mostly high aggregation level) - [NIH data repositories] (Links to topic specific repositories) - [YODA project] (Request access to RCT data) - [Project Datasphere] (Cancer Research databases) - [Nature recommended data repositories] - [ClinicalStudyDataRequest] (Request access to RCT data) ] [Open Science Framework]: https://osf.io/ [European Data Portal]: https://www.europeandataportal.eu/ [GitHub]: https://github.com/ [Dryad]: https://datadryad.org/ [Figshare]: https://figshare.com/ [Center for Open Science]: https://cos.io/ [Choose a License]: https://choosealicense.com/ [Creative Commons]: https://creativecommons.org/ [Go Fair]: https://www.go-fair.org/fair-principles/ [EU Turning FAIR into reality]: https://ec.europa.eu/info/sites/info/files/turning_fair_into_reality_0.pdf [Plan S]: https://www.coalition-s.org/ [Foster Open Science]: https://www.fosteropenscience.eu/ [NIH data repositories]: https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html [UK Data Archive]: https://data-archive.ac.uk/find/archive-catalogue [YODA project]: https://yoda.yale.edu/ [Project Datasphere]: https://www.projectdatasphere.org/projectdatasphere/html/home [ClinicalStudyDataRequest]: https://www.clinicalstudydatarequest.com/ [Nature recommended data repositories]: https://www.nature.com/sdata/policies/repositories ??? Finding and Obtaining Open Datasets - Lecture Notes In this section, we will look at some practical ways to find and access open data. This introduction is by no means exhaustive but is designed to help you find datasets that you can use to: 1) practice your data wrangling and analysis skills; 2) check the reproducibility of published papers; and 3) conduct original research. Along the way, we will see how researchers and institutions are currently publishing datasets and analytical code. We will also see that the situation is not as good as it could be. To start with, it is worth pointing out that accessible data is only one part of a wider set of principles guiding open and reproducible science. As you can see on Slide 2, Open Science also encompasses open access publication, open reproducible research, open science tools (such as R), and higher-level principles such as open science evaluation and policies. We will touch on several of these branches in this course, but at this point, the focus is on open and accessible data. When people think of open data, their first thought is often whether the data can be downloaded from the internet. However, this is only one part (i.e., accessibility) of the four key components of open data. For data to be truly open, it needs to be FAIR: Findable, Accessible, Interoperable and Reusable. Let’s explore these components. **Findability:** Data can only be findable if it is indexed somewhere by a search engine or repository. In an ideal situation, somebody who searches for any feature of the data should find it. This requires each data element (e.g., a variable or whole dataset) to have a Universally Unique Identifier (UUI), which is linked to contextual and searchable metadata (i.e., data about data). Contextual metadata can exist at different levels, ranging from individual variables to the whole dataset. The contextual part refers to the circumstances or context in which each variable and the whole dataset was created. Let’s consider what type of metadata would be relevant to have at the level of a dataset, including how the dataset was generated (i.e., study design, source population, invitation process, response rates, and informed consent process). At the level of a variable, contextual metadata could describe what the variable represents, how the measurement was obtained or derived (e.g., equipment used and quality control features), and which data cleaning, processing, and interpretation steps have already been taken (including when, by whom, and with which aim/perspective in mind). In effect, the metadata should ideally provide all the information you need to decide whether the variable is suitable for your own research purpose, as well as all the necessary background to use it comfortably. Some of these points also relate to reusability, which we will come to a bit later. There is an added layer to the expectations we should have for findability and the contextual metadata needed for this. Datasets should be findable not only by humans, but also by machines. This means that the metadata should preferably not only be available in prose or natural language, but also in machine-readable formats (standardised codes and ontologies). **Accessibility**: Accessibility relates to the steps you need to go through in order to actually get a copy of the data. Can you just download it from an online repository, or do you need to apply/pay for access? We will take a closer look at some of the different access models a bit later on. **Interoperability**: This means that the data and metadata should be in a format that can be used by different researchers with different tools. For example, it means that the data should not be in a format that forces you to use software owned only by one company, including having to have a licence for that software. A very simple example of an interoperable way of formatting data is to use a CSV file (comma separated file), which stores the raw data in the simplest way possible. When datasets become more complex (e.g., when there are multiple tables that are related to each other), an open format can be used where the description of the database and the structure of the database is included in a format file (e.g., as an XML schema) that tells software how the database is structured. There are some formats that have an intermediate form in terms of interoperability, such as SAS, SPSS, and Stata database formats. These formats are not directly interoperable since they are specific for the respective statistical packages, but their structure (schema) is known and can be used to transfer the data from one format to the other. Generally, these formats are remnants of dated closed approaches that have gradually been opened. Unfortunately, there is still lots of data in fully proprietary formats, which means that a company, university, or another organisation owns the only software that can be used to access data in their format. They do this by keeping the data schema secret and the data encrypted in such a way that it can only be accessed by using the software they provide. Many companies used to do this, including all the usual statistical packages and some word processors/spreadsheets, but most of these formats have been reverse-engineered and been made public in recent decades. However, we still encounter a lot of closed and non-interoperable data formats, mostly in software tied directly to measurement hardware. For example, many of the commercial accelerometers targeted at consumers will store the raw data they collect in a proprietary format, and only make a highly processed subset or summary of the data accessible to users. This means that these devices are of limited use for investigators who are interested in using all the collected data. Closed formats are also often seen in software linked to imaging and other measurement hardware. However, it is not only companies that choose to close their formats and thereby make their data schemas non-interoperable. In the diabetes field, the clearest example is the HOMA-2 model, which has been developed by Oxford University but has never been published. The university makes an online tool available that you can freely use to calculate HOMA-2, but the algorithm behind it is proprietary, which means that the only way to obtain HOMA-2 values is by using their website. **Reusability**: This refers to the question of whether you have all the information you need to assess whether you want, can, and are allowed to (re)use data for a different purpose. This is somewhat related to findability in the sense that rich and descriptive metadata helps you know what the dataset contains, but also with information about the licences, permissions, and procedures that govern the dataset. This information should allow you to assess whether the data owner allows the dataset to be used for the purposes you have in mind, under which conditions, and which commitments you make to acknowledgement of the data source. You may have noticed that the FAIR structure talks about principles and that it does not impose or suggest any specific implementation or technology. FAIR principles don’t dictate that data should be in the public domain. Most FAIR data do have a data owner and will be governed by a licence. We will see how this works a bit later on. Let’s turn to the practical aspects, starting with the first part: Finding data. On the slide, you will see some of the starting points for finding data that we often encounter. The easiest starting point is quite close to home; in your own research network. You may know of datasets that other researchers have collected. For example, you may hear an interesting talk at a conference and think, “I could do something interesting with that dataset”. So, after the presentation, you talk to the researcher and ask about gaining access to their data. In another example, perhaps your supervisor tells you, “There's this cool dataset out there. Email X and they will give you access.” This is very much the traditional way of finding data, and even though it is very focused, it is not scalable. A second approach starts with a publication you may come across which uses data that you find interesting, and where the dataset is available through a public repository. We will look into this option a bit more, as it is becoming increasingly common. There are some journals (e.g., eLife) that require that data are made available where possible. The PLoS journals require that all authors indicate how the data underlying each publication can be accessed. This could still be through direct contact with the authors, but many authors upload the datasets to repositories and provide a link in the paper. One search strategy could be to start with publications that you find interesting, check the data access statement, and access the data through the linked repositories. This approach already sounds better than going through networks, but we will see that there are still some disadvantages and many limitations. The third approach would be to search directly through the repositories for keywords linked to the topics that you are interested in. This also sounds like an attractive approach, but unfortunately, there are lists of repositories ranging from very general repositories to repositories focused exclusively on a very specific field. There are even search engines to find good lists of places to find open data. It means that you may have to search repeatedly from different starting points to identify a suitable open and accessible dataset. When you find a dataset that is suitable, the next practical question is: “What do I need to do to get access?” There is a wide range of options from closed to fully open (FAIR). Closed data is the traditional status. It basically means that you don’t have access unless you’re part of the research group or company that owns and holds the data. We won’t talk about this option too much here. Then, there are databases that you can buy access to via “commercial access”. An example is the CPRD, the Clinical Practice Research Datalink. This is a database that collects de-identified patient data from a network of general practices in the UK and is managed by the UK Medicines & Healthcare Products Regulatory Agency. Researchers can buy access to the database on a commercial basis. If we go to their webpage and click through to the page that outlines their pricing structure, you can see that it costs £330,000 to get a multiple access licence to run multiple projects on this database. So, that price alone puts it beyond the scope of most PhD projects, and even beyond the scope of most research institutes. It's mostly pharmaceutical companies that access databases like this. The next approach is data sharing within a collaboration or within a project. As we saw previously, this is the traditional way of doing things (e.g., your supervisor telling you to contact a certain research group). In this format, you and your collaborators all know each other and have either informal (usually oral) or semi-formal (written) agreements on how you will access and work with the data. This model of data sharing is still restricted because you must be part of the collaboration in order to access the data. Even though this is still very common practice, it is a model we would like to start moving away from. The next model is called “gated data sharing”. In this model, the database is accessible to people outside of the research group that collected the data, but the research group must approve your application to access the data. A dataset that I have worked with for several years at UCL in London, the “Whitehall II Cohort”, uses this model. They have a very clearly defined data sharing policy, which is published on their website. As you can see, they indicate that access is for bonafide (i.e., honest) research questions. This means that you need to write a data application that designates what your aim is and which variables you wish to use, as well as some background of your research experience, prior publications, and the institution you are affiliated with. This application is then discussed within a committee of professors and researchers from the Whitehall Study team, and they decide whether it fulfils their criteria for data sharing. This is the “gated” part of the model, where the study’s researchers are the gatekeepers deciding who can have access. In this format, there is no requirement for you to collaborate with the original researchers, although many still do as they have extensive experience in the structure and peculiarities of the data. In the case of the Whitehall II Cohort, there is an administrative fee, but it is much lower than the CPRD (i.e., £500) and only meant to cover the costs of processing applications and making the specific data selection for your project. There are examples of datasets that only require registration (i.e., who you are and what you want to do with the data) and there is no committee to judge whether your application is good enough. You must also sign some level of data sharing agreement where you commit to using and storing the data safely and securely, and do not try to contact study participants. An example of this model is the ELSA Study, another UK cohort study, that has archived its data on the UK Data Service (i.e., a national data archive for the UK including census data). On the website, you can simply make an account, submit a short description of the project, and download the entire dataset. You don't need to select variables. It takes about 7 to 10 days to get through the process of approval and get access to the complete dataset. This solution is a lot better than the previous solutions as it does not depend on committee approval, but in a lesser way, it is still gated. A similar procedure applies to obtaining data from one of the largest data sources, the UK Biobank. This is a dataset with extensive measurements on more than half a million participants, including full GWAS data and quite detailed phenotypes. In order to obtain UK Biobank data, you must apply with a project description, pay a processing fee, and commit to a data transfer agreement. However, the evaluation process does not consider the academic merits of your application or whether a different group may already be looking at this question. Finally, there is a totally open model in which can just download micro-data (i.e., individual level data in the case of human cohorts/surveys) from an online repository without a need to register or sign agreements. This model is much more typical outside of the medical sciences (e.g., astronomy, geology, biology, economics, and physics), where there are no issues around GDPR or privacy of individual participants. However, it is increasingly possible to find datasets in the medical sciences and epidemiology that can simply be downloaded in a totally open model. An example is the dataset you will be using today, the NHANES dataset. This dataset is a cleaned-up version made specifically for teaching purposes and incorporated into the NHANES package in R. You can visit the National Institutes of Health (NIH) website and see a complete list of all the questionnaires and different data waves included in this dataset, and you can download the parts you are interested in using here and now. In practice, you will find that downloading the data is only the first (and perhaps easiest) step. Before you can really start working with the data, you will need to find and read a lot of documentation in order to understand exactly how the sample was created, what the different variable names mean, and which peculiarities are important to keep in mind when analysing the data. This means that even if the NHANES data is Findable, Accessible, and Interoperable, its Reusability is still dependent on a lot of preparatory work. You may be wondering how it is possible to make individual level data openly accessible like this, while there are so many rules around data and privacy protection. In Europe, any organisation that collects, stores, processes, or makes data available on individuals needs to abide by the GDPR and its national implementations. In other countries, similar laws exist, albeit generally not as strict as the GDPR. First, the datasets you can download (with any type of access model) are pseudonymised. This means that any direct individual identifiers such as name, ID number (e.g., CPR), address, and date of birth have been removed from the dataset and replaced by a random ID number that does not match to other datasets. Although this makes identification of individual records in other databases more difficult, it does not make it impossible. It has been shown that if you have enough data fields in your data and you have access to a master database that includes these same variables, it is relatively easy to match and identify individuals based on propensity score matching. In order to make this more difficult, you would have to use further obfuscation techniques, such as rounding numbers to integers/one decimal so that the resulting groups become larger, or by adding some random noise. A second point to consider, and which makes the process for accessing individual level datasets very different than for accessing, for example, data from registries, is the presence of informed consent. If, at the time of data collection, the participants have been told that their participation is for a research dataset that will be in the public domain/shared with other researchers in an openly accessible way, and they have consented to this, it would be acceptable for the data to be made available. Of course, this will be highly dependent on how the original consent form was formulated. For example: in the Whitehall II Study, the informed consent forms signed in the earlier phases did mention use of the data for scientific purposes by researchers outside of the local team but explicitly stated that it wouldn’t be used for commercial purposes. If the dataset would be put online in a freely downloadable form, there would be no way to guarantee that it would not be used for commercial purposes; hence, the need for a committee that judges the academic merit of each project that requests data access. As such, it is important to realise that the exact formulation of the consent people provide can have very long-term consequences. We must think about this as we set up new studies. It also means that, if you are considering making an individual level participant dataset available, you should carefully check informed consent, other permissions, and local guidance/rules before doing so. Next, let's see what happens when we try to find open data. Let’s try the second approach to access and take a closer look at publications that provide a link to their data. The PLoS journals were one of the first to implement an explicit policy on data availability a few years ago. This policy states that every publication should include a link to the data that is behind the results being presented in that publication. At the very least, authors need to include a data statement either: 1) giving details of where the data can be found and which procedures apply to accessing it; 2) stating that all the data is already in the manuscript; or 3) stating the reasons why data access cannot be provided. Let’s visit the PLOS website and search for ‘diabetes’ to see what we find. Looking at a few of the most recent papers: What do they say in the data availability statement? You will most likely see a combination of no data, gated data access, and fully open access. In 2018, a paper was published that analysed the content of the data availability statements of the PLoS journals. The authors went through all the PLoS journals, extracted the data availability statements, and classified them. For example, it mentions ethical and legal restrictions in 7.4% of cases. This effectively means that there is no access to the data. In 24% of cases, the data statements say that all the reported data are in the paper, and in 45% of cases, that all the reported data are in the paper (or the supplemental information). I guess that this could be correct for small datasets, such as animal studies or studies of cell lines, where all the data could be in an Excel spreadsheet that is attached as supplemental material. But I suspect that, in many cases, the tables and attachments only contain a high-level summary of the data, and not all the raw data you would need to replicate the analyses. Finally, there is only 15.4% of papers that say that the data are available at a publicly accessible location. It is important to remember that these results apply to one of the most progressive groups of journals with respect to open data. In many other publications, the proportion of manuscripts that link to openly accessible data sources is likely to be much smaller. Let’s move on to using the third access model: searching in repositories for data that we think may be interesting. There are a lot of different repositories; some are very focused on one specific area and very dedicated to high quality control of the data hosted there, and other repositories are very broad with varied quality. One of the major repositories that I would encourage you to look at are Dryad and FigShare, as they are where most journal articles that link to data upload their datasets. Let’s look at FigShare and search for diabetes-related results. As you will probably see, this search brings up a lot of datasets, but it is not often immediately clear what the dataset includes. You would have to go through them one by one to know if they are relevant and the data you are looking for, if you had a pre-defined research question in mind. There is another disadvantage to this way of linking datasets to publications. Let’s imagine that you have obtained access to an open database (e.g., the NHANES data). You download the data from the internet and complete your analyses. When you want to publish the paper, the journal requests that you include the data your analyses are based on. So, what should you do? Should you just link to the original source of the NHANES data, and say that others can obtain it there? Or should you make a copy of the data with all the cleaning, processing, and analysis steps that you have conducted, and upload a partial copy of the data containing only the selected observations and variables actually used to generate the tables that are in your paper? This decision depends on whether you view the data access principles mostly with regard to replication within the confines of this specific analysis, or if you view access to data as part of a far wider set of aims that should enable other researchers to use open data for purposes other than just checking the specific details of your analyses. In the first case, it would probably be easiest for both the readers and the researchers to publish the selected and processed dataset used for your specific paper. However, this very much limits the usefulness of this shared dataset. If you wanted to enhance the wide re-use of the data, the second option would be preferable, but it would need to be accompanied by publication of the full syntax that is needed to download, process, and analyse the data. If that code is not made available, it would not be possible to use the linked data to check the analyses made in that specific paper. Unfortunately, you will often see that all that is published or linked to, is a cut down and processed version that is specific to only one paper. As mentioned, this is only useful for narrow sense replication, and not for reusability. Another problem is that the metadata is often missing. Like we saw before, metadata refers to all the information that tells you what each variable means and how it was obtained/processed. A professionally-managed cohort should have all this documentation, but unfortunately, you will often see that the selected versions of datasets published alongside papers do not include reference to all the metadata, and that you have to then search for all of that separately. This is another reason why it is preferable to link to the original data source, making sure that you publish all the code that is needed to process it. Dryad is similar, although it can be a little bit better in terms of providing a good workflow for you to upload a dataset, which encourages you to provide the relevant metadata and full data documentation. It also allows you to obtain a Digital Object Identifier (DOI) for a published dataset. A DOI is effectively a reference that other people can use to refer to your published dataset. This encourages a development that we will hopefully see a lot more of in the future: researchers publishing well-documented data products, not just as background data to a paper but as independent research outputs. Then, there are collections. For instance, there is somebody who has made this GitHub repository with a list of awesome public datasets. There is a massive amount of data on GitHub, but again, the problem is that it is very difficult to find exactly what you are looking for. Specific to Denmark, there is the Danish Data Archive. This is a part of the Danish National Archives and is a government service that stores data of all types that people believe is important to conserve long-term. This sometimes includes individual level data from studies that have been concluded. As a researcher, you can register your study database with the Danish Data Archive along with any relevant metadata and specify if the data should be made immediately accessible or if it should be locked for a certain number of years. From that point on, the archive will ensure that the data is kept and made accessible under the conditions that the data owner specified. So, if you are looking for Danish data, you can search the Data Archive directly and sometimes find interesting or relevant data along with a description. An example of a repository for a very specific and narrow research field is for MRI and functional MRI data. It documents several data sources ranging from small to very large, and the datasets are generally very well documented. The datasets are very large, so if you are working in the field, this repository would give you access to terabytes of research data you can use. Now, let’s assume that you have found a dataset that you want to use. There is one further important step you must take before you can get started: checking under which license the dataset has been made available. The fact that data is open, accessible, and downloadable does not always mean that you can use it for any purpose. The first important distinction is whether the data has a data owner. If data (or an image, music, text, or software) has no owner, it means that it is in the public domain. This means that nobody has copyright on the data and that you can use it for any purpose. This is the case if the data owner has explicitly placed the dataset in the public domain by relinquishing all their rights over it, or if it is in the public domain by law (either because it is older than a certain number of years after the creator’s death/date of creation), or because it is created by a public official in some countries. Rules related to copyright vary in different countries but are harmonised within the EU. If the data (or any other ‘work’) is not explicitly in the public domain, it means it has an owner who can decide under which conditions to allow access/use. This is described in a license. A licence is like a contract between the data owner and the data user, specifying what the data user is allowed to do under which conditions. Licenses can be bespoke (i.e., specifically written by the data owner for that dataset) or general. In order to make data sharing and open data use easier, several organisations have created general licences that a data owner can choose in order to allow users to use their data without placing it in the public domain. A good example is Creative Commons (CC). They have created several licenses at increasing levels of openness that data owners can choose to publish their data (or works) if they want people to use it without payment. There are four key components to CC licenses: attribution, non-commercial use, no derivatives, and share alike. Attribution (BY) means that you need to reference the source and the data owner if you use the data (or the work). Non-commercial use (NC) means that you are only allowed to use the data (or the work) for non-commercial purposes (i.e., you are not allowed to charge for whatever you produce based on that data). No derivatives (ND) means that you are not allowed to create derivative works, in that you cannot alter/modify or wrangle (e.g., merge, select, or subset) the work, and have to use it exactly in the way it was provided. Finally, there is share alike (SA), which means that you are allowed to create derivative works, but if you share them, you have to share them with the same type of licence as the one you got for the components. These components can be applied in several combinations. The most restrictive combination is CC BY-NC-ND. If you join up all the conditions, you will see that any dataset made available under this license would be virtually unusable for open and reproducible research. The ND part, specifically, would not allow you to make selections of participants or to calculate a BMI from height and weight, as these would be derivatives of the original dataset. This would apply to any licence that includes the ND clause. The SA part is also quite restrictive as it means that you can create a derivative work (your manuscript) but you can only publish it with the same license. The NC clause would prevent you from publishing in a pay-walled journal as this would mean that commercial benefit is derived from the data. So, if you are looking at a dataset, be sure to look for the license attached. Look for CC0 or CC-BY. CC0 means that there are no conditions attached. CC-BY means that all you need to do is provide attribution (i.e., reference) to the source. If you use a CC-BY dataset, please remember to do that! If you are the data owner and are considering which licence you should choose for your data before publishing it, make sure you don’t make the licence too restrictive. Restricting your licence too much will prevent many very relevant use cases. Creative Commons is not the only type of licenses that can be used in open research. Some licenses specifically designed for data are being created (Open Data Commons - ODC licenses) and there are other ‘open’ licences such as the GNU and MIT licences, although these are generally mostly used for open source software rather than for data. To summarise, we have seen that open data is only a part of the wider Open Science universe and that open data should ideally be FAIR but that this is often only partially the case. We must pay importance to the availability and quality of metadata, both when we look for open data and when we are developing our own open data. We have seen that there are many ways of finding open data and that unfortunately none of these ways is yet ideal. Finally, we have discussed that open data generally still has a data- owner and that you should be very aware of the conditions that the data licence sets for how you can use it. To conclude, I think we can say that we have started on the right path towards a situation where research data is open, but that we still have a long way to go.