Quickstart
How to get started with the dod2k environment, functions, notebooks and products.
For database use (DoD2k)
-
Get the project: in a working directory,
git clone https://github.com/lluecke/dod2k.git -
Create and activate the python environment: in dod2k/,
conda env create -n dod2k-env -f dod2k-env.yml conda activate dod2k-env -
Explore DoD2k: use the notebooks
notebooks/df_info.ipynb notebooks/df_plot_dod2k.ipynb notebooks/df_filter.ipynb -
Applications of DoD2k
-
For analysis of moisture/temperature/moisture and temperature sensitive records use
``` notebooks/analysis_M.ipynb notebooks/analysis_MT.ipynb notebooks/analysis_T.ipynb ```- For speleothem analysis:
To run
notebooks/S_analysis_v1.6.ipynbyou will first need to create the directorydata/speleothem_modeling_inputs, and download into it data from their source urls:mkdir speleothem_modeling_inputs cd speleothem_modeling_inputs wget https://wateriso.utah.edu/waterisotopes/media/ArcGrids/GlobalPrecip.zip unzip GlobalPrecip.zip wget https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.07/cruts.2304141047.v4.07/tmp/cru_ts4.07.1901.2022.tmp.dat.nc.gz gunzip cru_ts4.07.1901.2022.tmp.dat.nc.gz
For toolkit use (DT2k)
-
Get the project: in a working directory,
git clone https://github.com/lluecke/dod2k.git -
Create and activate the python environment: in dod2k/,
conda env create -n dod2k-env -f dod2k-env.yml conda activate dod2k-env -
Create a common dataframe from source databases (OPTIONAL)
-
Load scripts for input databases:
data/pages2k/load_pages2k.ipynb data/fe23/load_fe23.ipynb data/iso2k/load_iso2k.ipynb data/sisal/load_sisal.ipynb data/ch2k/load_ch2k.ipynb -
Merge databases
data/dod2k/merge_databases.ipynb
Note: these notebooks serve for creating compact dataframes from source data and for creating a common dataframe by merging all the databases into one dataframe. If you are not interested in this step, it can be skipped and you can use the compact dataframes as provided in the directories (
csvorpklfiles). For altering the source data (e.g. updating a database or adding one), you can add/edit these notebooks accordingly. -
-
Run duplicate workflow
The following steps recreate the complete duplicate workflow.
-
Duplicate detection: If you have altered any source data, run:
notebooks/dup_detection.ipynbThis notebook goes through each pair of records to identify potential duplicate candidates. Careful, this will be computationally heavy and may take some time to run! The notebooks outputs the file
This file will be used for the decision process (next step). If you have not changed any source data, you may skip this step and proceed with the next step.root/data/dod2k/dup_detection/dup_detection_candidates_dod2k.csv -
Duplicate decision process: run
This file walks you through all the potential duplicate candidates and asks for decisions on certain duplicate candidate pairs. The decisions are saved innotebooks/dup_decision.ipynbroot/data/dod2k/dup_detection/dup_decisions_dod2k_{INITIALS}_{DATECREATED}.csvNote: The decision process may be lengthy and get interrupted by server issues. However a backup file is created during the workflow and it should be possible to restart where you left off when running the file. However in order for this to work it is required that your initials and the date match the backup file!! If you restart on another day, it is necessary that you alter the date of the backup file accordingly. The backup file can be found here:
root/data/dod2k/dup_detection/dup_decisions_dod2k_{INITIALS}_{DATECREATED}_BACKUP.csv -
Duplicate removal process: run
to implement all the decisions and to create a duplicate free compact dataframe.notebooks/dup_removal.ipynb
-
-
Rerun the duplicate process (check for remaining duplicates) for
dod2k_dupfree. Createsdod2k_dupfree_dupfree(which is published as DoD2k) -
Explore output (see step #2 for database use)
If you want to see your own output you will need to alter the
keyfor loading according to your initials and the date of the file created:db_name = 'dod2k_dupfree_dupfree' path = 'data/dod2k/' file = 'dod2k_dupfree_{INITIALS}_{DATECREATED}_dupfree' # load dataframe df = utf.load_compact_dataframe_from_csv(db_name, readfrom=(path, filename)) print(df.info()) df.name = db_name