20.21 Analysis templates
This page summarizes standards for data analysis, including conventions related to the repository, the directory structure, and the data.
Table of contents
Repository
- Data should be versioned with Git. Repositories can be hosted and shared on GitHub.
- Visibility may initially be set to private. It is essential not to publish confidential or copyright protected data (this may include primary data from surveys, or PDF documents with copyrights owned by publishers)
- Once published, the team decides to switch to public visibility.
- Appropriate linters should be activated.
- See repo-example, or deep-cenic example.
Directory structure
├── README.md <- The top-level README summarizing the project.
├── CITATION.cff <- How to cite the work.
├── Makefile <- Makefile with commands like `make data` or `make train`.
├── Dockerfile <- Docker image to standardize the computational environment.
├── requirements.txt <- The requirements file for reproducing the analysis environment.
├── LICENSE <- A text file containing the license.
├── .pre-commit-config.yaml
| <- The configuration for pre-commit hooks
├── .gitignore <- Excludes files from versioning
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── src <- Source code for use in this project.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── docs <- Data dictionaries, manuals, and all other explanatory materials.
Data
- Tabular datasets should follow a tidy data structure.
- Preferred file formats include text files like md, txt, csv, py, ipynb, r.
- Binary formats, such as docx, pptx, or xlsx should be avoided.
- Resources on research data management may be helpful.
Docker
TODO : summarize the use of Docker containers see analysis directory
Useful links and textbooks
- Cookiecutter data-science projects
- List of tools for labeling tasks
- Blog entry: avoid using Docker:latest
- Danchev, V. (2021). Reproducible Data Science with Python. link
- R: be careful with setwd()
- R in GitHub codespaces
- Datasette.io