What Data will be collected?
How will the data/samples be collected and analysed?
Finding Data
What Data will be collected?
Researchers need to plan what data they need to collect, and how to collect it.
It is essential to decide what data will be used in the research project work and how will it be represented
Researchers also need to determine the level of sensitivity of the data.
At the same time, researchers need to determine who will have access to the data.
There are Three Types of Data
There are 4 (four) main categories of data: nominal, ordinal, discrete, and continuous. Understanding these data categories can help toward choosing the appropriate analysis techniques and make sense of the information encountered.
There are 5 (five) data classification types
Sensitive Data
Sensitive data is data that must be protected against unwanted disclosure. Access to sensitive data should be safeguarded. Protection of sensitive data may be required for legal or ethical reasons, for issues pertaining to personal privacy, or for proprietary considerations.
Examples of sensitive data are:
Interactive course module: De-identification and anonymisation of transcript data. UK Data Service
Interactive course module: De-identification and anonymisation of quantitative data. UK Data Service
Research data exist in many different forms: Textual, numerical, databases, geospatial, images, audio-visual recordings and data generated by machines or instruments.
Digital data exists in specific file formats, which are coded so that a software programme can read and interpret these data.
Using standard and interchangeable or open lossless data formats ensures longer-term usability of data.
For long term preservation, digital data is converted to such formats. UK Data Service
Non-Digital formats: Most researchers keep handwritten laboratory notebooks, journals and other materials, examples of which may be surveys, paintings, fossils, minerals and tissue. However, non-digital data can be converted to a digital source in a variety of ways. OpenAIRE
Digital Formats: Formats of original items that have been digitised. Other formats include born-digital items.
Digital and non-digital formats: Includes files, spreadsheets, documents, images, videos, audio, notebooks, diaries, sketches, artifacts, paper surveys, etc. Ghent University. University of Pittsburgh. UK Data Service
Structured Data
Structured data is data that has a standardized format for efficient access by software and humans alike. It is typically tabular with rows and columns that clearly define data attributes, e.g. a database. Computers can effectively process structured data for insights due to its quantitative nature. This is done so the data's elements can be made addressable for more effective processing and analysis. The data resides in a fixed field within a record or file. TechTarget
Unstructured Data
Unstructured data is information, in many different forms, that doesn't follow conventional data models, making it difficult to store and manage in a mainstream relational database.
The majority of new data generated today is unstructured, prompting the emergence of new platforms and tools to manage and analyze this data. These tools let organizations more easily use unstructured data for business intelligence (BI) and analytics applications.
Unstructured data has an internal structure but doesn't contain a predetermined data model or schema. It can be textual or nontextual, human-generated or machine-generated.
Text is one of the most common types of unstructured data. Unstructured text is generated and collected in a range of forms, including Word documents, email messages, PowerPoint presentations, survey responses, transcripts of call center interactions, and posts from blogs and social media sites.
Other types of unstructured data include images, audio and video files. Machine data is another category of unstructured data that's growing fast in many organizations. For example, log files from websites, servers, networks and applications -- particularly mobile ones -- yield a trove of activity and performance data. In addition, companies increasingly capture and analyze data from sensors on manufacturing equipment and other devices connected to the internet of things (IoT). TechTarget
Big Data
Big data is a combination of structured, semi-structured and unstructured data that organizations collect, analyze and mine for information and insights. It's used in machine learning projects, predictive modeling and other advanced analytics applications.
Systems that process and store big data have become a common component of data management architectures in organizations. They're combined with tools that support big data analytics uses. Big data is often characterized by the three V's:
How will the data/samples be collected and analysed?
Researchers need to decide:
Primary Data
Definition: Data that has been generated by the researcher himself/herself, surveys, interviews, experiments, specially designed for understanding and solving the research problem at hand. Benedictine University
Primary data, on the other hand, is information collected by a researcher to address a specific issue or problem. As it has not yet been gathered or it may not be accessible, it is data that is unique, first-hand and from an original data source. The data is collected by a researcher using a variety of techniques, such as interviews, focus groups, surveys and observations. Open University
Primary (generated by the researcher for a particular research purpose or project) or secondary nature (originally created by someone else for another purpose). To prepare data for secondary research, researchers should document data appropriately. They should also explain the procedures and fieldwork methods, the objectives and methodology of the research, and explicitly describe the meanings of variables and codes used. Additionally, they should describe any derivation, transformations, de-identification (pseudonymisation/anonymisation) or data cleaning carried out.
Examples of Research Data:
Finding Data
One of the key issues to think about at the start of Data Management Planning is to decide where to locate/find data
"Finding the right data for the research is easiest when there is a well-thought-out strategy." Texas Universities
It is important to know for researchers to decide whether to collect new data or reuse existing data. University of Pittsburgh
Secondary Data
Definition: Using existing data generated by large government Institutions, healthcare facilities etc. as part of organizational record keeping. The data is then extracted from more varied datafiles. Benedictine University
A secondary data source refers to a data source that is already in existence and is being used either for a purpose for which it was not originally intended or by someone other than the researcher who collected the original data (Salkind, 2010).
Locating existing data
Identifying and locating sources of existing data can be important for a variety of reasons, including:
Secondary data can be raw data or published summaries and you can tailor the data according to your research needs. Examples include large databases of surveys, censuses, and social and economic data that are too expensive or unfeasible for an individual to collect.
Other types of secondary data include organisational records and surveys (e.g. employee surveys), market research data, or transcripts of interviews or focus groups. Whether you are collecting data for your own project or compiling a portfolio of evidence, secondary data serves as a time-efficient and easy to obtain source of information. Open University
Data Directories
These online directories maintain lists of data sources and repositories across a wide range of disciplines.
Open Access Directory of Data Repositories - This is a list of repositories and databases for open data for a wide range of subject areas.
General repositories
These repositories maintain data from a wide range of subject areas and are not limited to a particular discipline.
Figshare is a provider of open research repository infrastructure. Our solutions help organizations and researchers share, showcase and manage their research outputs in a discoverable, citable, reportable and transparent way. It is a repository for sharing all types of research output in any subject - includes papers, figures, posters, slides.
Figshare is a provider of repository software.
We support organizations and researchers in meeting the growing demands for research to become open, freer, FAIRer and more connected. Figshare provides the flexibility and control for you to create research management workflows that work for you.
Amazon Web Services Public Data Sets * - This registry exists to help people discover and share datasets that are available via AWS resources. It hosts a variety of large public datasets, such as Landsat, census, and genomic data. Creating an account may be required and charges may apply for computing time and data transfer.
See all usage examples for datasets listed in this registry.
See datasets from Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Data for Good at Meta, NASA Space Act Agreement, NIH STRIDES, NOAA Open Data Dissemination Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Discipline related repositories
The following are examples of data repositories that focus on a particular subject area, discipline, or cluster of related disciplines within the broad categories of humanities, sciences, social sciences, and government.
Linguistics
OLAC – Open Language Archives Community is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. It is an international partnership “creating a worldwide virtual library of language resources,” currently with 58 participating archives.
TROLLing-Tromsø Repository of Language and Linguistics - An open access repository of linguistic data and statistical code.
Music
Mutopia Project - Free sheet music. Pieces of music – free to download, modify, print, copy, distribute, perform, and record – all in the Public Domain or under Creative Commons licenses, in PDF, MIDI, and editable LilyPond file formats
Biology/Life Sciences
DRYAD - General purpose repository for data underlying scientific and medical publications, historically with a concentration in life sciences.
Gene Expression Atlas - Information on gene expression patterns under different biological conditions, such as different cell types, organism parts, or diseases. The home for big data in biology.
genenames.org (HUGO Gene Nomenclature Committee) - Curated repository of HGNC approved gene names and symbols, gene families, and links to related genomic, proteomic, and phenotypic information.
NCBI (National Center for Biotechnology Information) - The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. It provides access to a variety of sources for biomedical and genomic data, including:
UniProt (The Universal Protein Resource) - Collection of databases that provide a comprehensive source for protein sequence and annotation data, including a repository for metagenomics and environmental data.
Chemistry
eCrystals - Mostly open access source of fundamental and derived data from single crystal X-ray structure determinations from the University of Southampton and EPSRC UK National Crystallography Service. The information contained within each entry of this archive is all the fundamental and derived data resulting from a single crystal X-ray structure determination, but excluding the raw images.
PubChem - Database of chemical substances with descriptive and property information along with bioactivity screening data. PubChem mostly contains small molecules, but also larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically-modified macromolecules. We collect information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others.
Zinc15 - Database of commercially available compounds with 3-D structure representations in a format ready for virtual screening for potential biological activity. Zinc15 a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, 3D formats. ZINC also contains over 750 million purchasable compounds you can search for analogs in under a minute.
Economics
GTAP Database – Global Trade Analysis Project - The centerpiece of the Global Trade Analysis Project is a global data base describing bilateral trade patterns, production, consumption and intermediate use of commodities and services. It describes bilateral trade patterns, production, consumption and intermediate use of commodities and services.
GeoFRED® - Geographical Economic Data - Maps of data contained in FRED®. Create customized maps and download data.
Traditional Journals that Publish Data
These traditional "data journals" publish only articles that focus on presenting data, either experimental or computational, or may review experimental methods.
Journal of Physical and Chemical Reference Data - Publishes articles reporting critically evaluated reference data and property measurements.
Journal of Chemical and Engineering Data - Publishes both experimental and computational data.
Data Journals or "Data Paper" Journals
These newer style "data journals" primarily publish articles that describe publicly available datasets and link to those datasets.They may also publish articles on data-related topics, such as describing or reviewing certain analytical or statistical methods. However, traditional research articles that actually analyze the data and draw conclusions from that analysis are generally outside the scope of these journals.
Biodiversity Data Journal - Community peer-reviewed and open-access. Promotes the publishing, dissemination and sharing of biodiversity-related data of any kind. Publishes data papers, general articles, software descriptions, species inventories, and more.
Earth System Science Data - An international interdisciplinary journal that provides a distinctive model for publishing papers about original research data sets and encouraging the reuse of high quality data. Includes methods and review articles and a "living data" process for handling datasets that undergo regular updating or extension.
IUCrData - Open-access and peer-reviewed. Provides descriptions of crystallographic datasets and datasets from related disciplines.
Scientific Data - Open-access and peer-reviewed. Its Data Descriptor articles describe data sets, the method of data collection and analyses relating to the quality of the data. They also link to one or more published sources of the data.
Mixed Journals
These journals publish a mixture of article types, including "data papers" that describe datasets along with traditional research articles and other formats.
International Journal of Robotics Research - Publishes peer-reviewed data papers and multimedia extensions in addition to articles.
Internet Archaelogy - Open access and peer-reviewed. Publishes data papers as well as research articles, methodologies, reviews and more.
Nucleic Acids Research - For more than 20 years has published a special issue in January that reports on databases containing data related to bioinformatics generally, including nucleic acids, proteins, and genomics.
These are only a few examples of journals that can point you to useful data. For more complete listings, check these sites:
Sources of Dataset Peer Review (from the Edinburgh DataShare Wiki)
A Growing List of Data Journals (from Data@MLibrary)
Open Data Journals (from the FOSTER project)
Survey Data
Survey data, including data from long-running surveys, series and longitudinal studies, are a major part of social science research. This section provides guidance and training resources around using survey and longitudinal data including short videos, on-demand webinars, event materials and more detailed written guides.
Stata is a statistical software package for data analysis. You can use Stata by pointing and clicking, or by using the command syntax. The software can support complex analysis, and, as it is so programmable, developers and users continue to add new features.
What is SPSS 20 for Windows? (PDF) . The IBM® SPSS® software platform offers:
Its ease of use, flexibility and scalability make SPSS accessible to users of all skill levels. What’s more, it’s suitable for projects of all sizes and levels of complexity, and can help you find new opportunities, improve efficiency and minimize risk. SPSS
SPSS is a software package for Windows and can be used to to produce graphics of data as well as other data analysis.
Nesstar enables you to search, browse, visualise, analyse and download a selected range of different kinds of social and economic data, from survey data to multidimensional tables.
CLOSER (Cohort & Longitudinal Studies in Enhancement Resources) aims to maximise the use, value, and impact of longitudinal studies. Part of their work includes training and capacity building for researchers. They have a Learning Hub with information and resources on longitudinal data run training events. We are the interdisciplinary partnership of leading social and biomedical longitudinal population studies.
P|E|A|S (Practical Exemplars and Survey Analysis) provides useful information and examples for researchers analysing data from complex samples. It also includes sections on survey non-response.
NCRM provide methodological training and resources to help people interested in social science research methods.
A short video tutorial Finding and using survey documentation is useful for surveys available via the UK Data Service.
Hierarchical datasets contain information about more than one unit; for example, we can have data at the individual and household level, which are often contained in separate data files.
For examples and practical advice on working with hierarchical datasets, see our guide What are hierarchical Files? (PDF).
You can also find practical instructions for linking files in our guides to Stata and SPSS, available under survey software and tools.
The guide to Analysing change over time (PDF) introduces data and methods for studying change over time quantitatively.
Course Module: Introductory module: Survey Data
Longitudinal Data
A longitudinal study is a research conducted over an extended period of time. It is mostly used in medical research and other areas like psychology or sociology.
When using this method, a longitudinal survey can pay off with actionable insights when you have the time to engage in a long-term research project.
Longitudinal studies often use surveys to collect data that is either qualitative or quantitative. Additionally, in a longitudinal study, a survey creator does not interfere with survey participants. Instead, the survey creator distributes questionnaires over time to observe changes in participants, behaviors, or attitudes.
Many medical studies are longitudinal; researchers note and collect data from the same subjects over what can be many years. QuestionPro
Course Module: Introductory module: Longitudinal Data. UK Data Service
Primary Data vs Secondary Data
BASIS FOR COMPARISON | PRIMARY DATA | SECONDARY DATA |
---|---|---|
Meaning | Primary data refers to the first hand data gathered by the researcher himself. | Secondary data means data collected by someone else earlier. |
Data | Real time data | Past data |
Process | Very involved | Quick and easy |
Source | Surveys, observations, experiments, questionnaire, personal interview, etc. | Government publications, websites, books, journal articles, internal records etc. |
Cost effectiveness | Expensive | Economical |
Collection time | Long | Short |
Specific | Always specific to the researcher's needs. | May or may not be specific to the researcher's need. |
Available in | Crude form | Refined form |
Accuracy and Reliability | More | Relatively less |
Situational data – image or video that exists already but when used in research, it become situational data for that researcher. Situational data can also be created by researchers for one purpose and used by another set of researchers at a later date for a completely different research agenda
Simulation data is data generated from test models where model and metadata may be more important than output data from the model e.g. economic or climate models. Computational social science.UK Data Service
Raw or processed nature
Derived or compiled data is a result from processing or combining 'raw' data, often reproducible but expensive e.g. compiled databases, text mining, aggregate census data. UK Data Service