20.21 Analysis templates

This page summarizes standards for data analysis, including conventions related to the repository, the directory structure, and the data.

Repository
Directory structure
Data
Docker
Useful links and textbooks

Repository

Data should be versioned with Git. Repositories can be hosted and shared on GitHub.
Visibility may initially be set to private. It is essential not to publish confidential or copyright protected data (this may include primary data from surveys, or PDF documents with copyrights owned by publishers)
Once published, the team decides to switch to public visibility.
Appropriate linters should be activated.
See repo-example, or deep-cenic example.

Directory structure

├── README.md          <- The top-level README summarizing the project.
├── CITATION.cff       <- How to cite the work.
├── Makefile           <- Makefile with commands like `make data` or `make train`.
├── Dockerfile         <- Docker image to standardize the computational environment.
├── requirements.txt   <- The requirements file for reproducing the analysis environment.
├── LICENSE            <- A text file containing the license.
├── .pre-commit-config.yaml
|                      <- The configuration for pre-commit hooks
├── .gitignore         <- Excludes files from versioning
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── src                <- Source code for use in this project.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── docs               <- Data dictionaries, manuals, and all other explanatory materials.

Data

Tabular datasets should follow a tidy data structure.
Preferred file formats include text files like md, txt, csv, py, ipynb, r.
Binary formats, such as docx, pptx, or xlsx should be avoided.
Resources on research data management may be helpful.

Docker

TODO : summarize the use of Docker containers see analysis directory

Useful links and textbooks

Cookiecutter data-science projects
List of tools for labeling tasks
Blog entry: avoid using Docker:latest
Danchev, V. (2021). Reproducible Data Science with Python. link
R: be careful with setwd()
R in GitHub codespaces
Datasette.io