Loading the DoD2k Database
This tutorial shows you how to load and explore the DoD2k paleoclimate database.
What is DoD2k?
DoD2k (Database of Databases 2k) integrates five major paleoclimate databases:
| Database | Version | Data | Reference | Records | Archives |
|---|---|---|---|---|---|
| PAGES 2k | v2.2.0 | LiPDverse | PAGES 2k Consortium 2017 | 1364 | multi-proxy |
| SISAL | v3 | ORA | Kaushal et al. 2024 | 546 | speleothems |
| Iso2k | v1.1.2 | LiPDverse | Konecky et al. 2020 | 435 | multi-proxy |
| CoralHydro2k | v1.0.1 | LiPDverse | Walter et al. 2023 | 221 | corals |
| FE23 | - | NCEI | Evans et al. 2022 | 2754 | tree-rings |
Since these databases may share a number of records, these databases were subject to a duplicate detection and removal process. The resulting output is DoD2k.
The database
The database is saved in root_dir/data/dod2k and provided in two data formats:
- pickle format (fast, python only):
dod2k_compact.pkl - comma seperated value format (portable, readable)
dod2k_compact_year.csv dod2k_compact_metadata.csv dod2k_compact_paleoData_values.csv
In this directory we also provide a text file
dod2k_compact_README.txt
Which Format Should I Use?
- CSV: Best for interoperability and inspecting data manually
- Pickle: Faster loading, preserves numpy arrays without conversion - python only!!!
Here we explain how to load these files into a pandas dataframe using python.
Loading the database from csv
Step 1: Set up your environment
Start in the repository root directory (dod2k/). From here import
import sys
from pathlib import Path
# Add dod2k to path
dod2k = Path().resolve().parent
sys.path.insert(0, str(dod2k))
print(dod2k)
from dod2k_utilities.ut_functions import load_compact_dataframe_from_csv
load_compact_dataframe_from_csv imports the different csv files and stitches them together to form a dataframe. See load_compact_dataframe_from_csv() for details.
Step 2: Load the data from csv
# Load the duplicate-free database
df = load_compact_dataframe_from_csv('dod2k')
# Check the shape
print(f"Database contains {len(df)} records")
print(f"Columns: {', '.join(df.columns)}")
Alternatively: Load the database from the pickle
For faster loading and if you only need python access, use the pickle format.
Make sure you start in the repository root directory (dod2k). From here import
import pandas as pd
# Load the duplicate-free database
df = pd.read_pickle('/data/dod2k_dupfree_dupfree/dod2k_compact.pkl')
print(f"Database contains {len(df)} records")
Explore the dataframe and visualise the data
Step 3: Explore the dataframe column by column
import pandas as pd
import numpy as np
Under dod2k/notebooks you can find the notebook df_info.ipynb. This notebook goes through the dataframe column by column and shows you a quick summary of the entries.
The key features of this notebook are:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4781 entries, 0 to 4780
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 archiveType 4781 non-null object
1 dataSetName 4781 non-null object
2 datasetId 4781 non-null object
3 duplicateDetails 4781 non-null object
4 geo_meanElev 4699 non-null float32
5 geo_meanLat 4781 non-null float32
6 geo_meanLon 4781 non-null float32
7 geo_siteName 4781 non-null object
8 interpretation_direction 4781 non-null object
9 interpretation_seasonality 4781 non-null object
10 interpretation_variable 4781 non-null object
11 interpretation_variableDetail 4781 non-null object
12 originalDataURL 4781 non-null object
13 originalDatabase 4781 non-null object
14 paleoData_notes 4781 non-null object
15 paleoData_proxy 4781 non-null object
16 paleoData_sensorSpecies 4781 non-null object
17 paleoData_units 4781 non-null object
18 paleoData_values 4781 non-null object
19 paleoData_variableName 4781 non-null object
20 year 4781 non-null object
21 yearUnits 4781 non-null object
dtypes: float32(3), object(19)
memory usage: 765.8+ KB
None
# archiveType
key = 'archiveType'
print('%s: '%key)
print(np.unique(df[key]))
print(np.unique([str(type(dd)) for dd in df[key]]))
archiveType:
['bivalve' 'borehole' 'coral' 'documents' 'glacier ice' 'ground ice'
'hybrid' 'lake sediment' 'marine sediment' 'mollusk shells'
'sclerosponge' 'speleothem' 'terrestrial sediment' 'tree']
["<class 'str'>"]
# paleoData_proxy
key = 'paleoData_proxy'
print('%s: '%key)
print(np.unique([kk for kk in df[key]]))
print(np.unique([str(type(dd)) for dd in df[key]]))
paleoData_proxy:
['BSi' 'Documentary' 'MXD' 'Mg/Ca' 'Sr/Ca' 'TEX86' 'TRW' 'alkenone'
'borehole' 'calcification' 'calcification rate' 'chironomid'
'chrysophyte' 'd13C' 'd18O' 'd2H' 'diatom' 'dynocist MAT' 'foram Mg/Ca'
'foram d18O' 'foraminifera' 'growth rate' 'historic' 'hybrid' 'melt'
'midge' 'pollen' 'reflectance' 'sed accumulation' 'varve property'
'varve thickness']
["<class 'str'>"]
For further guidance see the interactive notebook.
Step 4: Visualise the dataframe
Under dod2k/notebooks you can find the notebook df_plot_dod2k.ipynb. This notebook visualises the dataframe and produces summary figures of the database. It also reproduces the manuscript figures.
Import the python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from matplotlib.gridspec import GridSpec as GS
from copy import deepcopy as dc
from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_plot as uplt # contains plotting functions
After loading the dataframe, start off by counting the number of records in each archive type:
# count archive types
archive_count = {}
for ii, at in enumerate(set(df['archiveType'])):
archive_count[at] = df.loc[df['archiveType']==at, 'archiveType'].count()
Now count the number of records for each proxy type, depending on the archive type:
archive_proxy_count = {}
archive_proxy_ticks = []
for ii, at in enumerate(set(df['archiveType'])):
proxy_types = df['paleoData_proxy'][df['archiveType']==at].unique()
for pt in proxy_types:
cc = df['paleoData_proxy'][(df['paleoData_proxy']==pt)&(df['archiveType']==at)].count()
archive_proxy_count['%s: %s'%(at, pt)] = cc
archive_proxy_ticks += [at+': '+pt]
For each archive type, specify colours for each archive, but also distinguish between major archives (which have the most records) and minor archives (rare ones, only including less than ten records):
archive_colour = {'other': cols[-1]}
other_archives = []
major_archives = []
sort = np.argsort([cc for cc in archive_count.values()])
archives_sorted = np.array([cc for cc in archive_count.keys()])[sort][::-1]
for ii, at in enumerate(archives_sorted):
print(ii, at, archive_count[at])
if archive_count[at]>10:
major_archives +=[at]
archive_colour[at] = cols[ii]
else:
other_archives +=[at]
archive_colour[at] = cols[-1]
plot_count_proxy_by_archive_short()
uplt.plot_count_proxy_by_archive_short(df, archive_proxy_count, archive_proxy_ticks, archive_colour)
Next plot a spatial plot of all the proxy records:
#%% plot the spatial distribution of all records
proxy_lats = df['geo_meanLat'].values
proxy_lons = df['geo_meanLon'].values
# plots the map
fig = plt.figure(figsize=(15, 12), dpi=350)
grid = GS(1, 3)
ax = plt.subplot(grid[:, :], projection=ccrs.Robinson()) # create axis with Robinson projection of globe
ax.add_feature(cfeature.LAND, alpha=0.5) # adds land features
ax.add_feature(cfeature.OCEAN, alpha=0.6, facecolor='#C5DEEA') # adds ocean features
ax.coastlines() # adds coastline features
ax.set_global()
# loop through the data to generate a scatter plot of each data record:
# 1st loop: go through archive types individually (determines marker type)
# 2nd loop: through paleo proxy types attributed to the specific archive, which is colour coded
mt = 'ov^s<>pP*XDdh'*10 # generates string of marker types
archive_types = major_archives+other_archives
# archive_types = [aa for aa in archive_types if aa!='other']
ijk=0
for jj, at in enumerate(archive_types):
arch_mask = df['archiveType']==at
arch_proxy_types = np.unique(df['paleoData_proxy'][arch_mask])
for ii, pt in enumerate(arch_proxy_types):
pt_mask = df['paleoData_proxy']==pt
at_mask = df['archiveType']==at
label = at+': '+pt+' ($n=%d$)'% df['paleoData_proxy'][(df['paleoData_proxy']==pt)&(df['archiveType']==at)].count()
marker = mt[ii] if at in major_archives else mt[ijk]
plt.scatter(proxy_lons[pt_mask&at_mask], proxy_lats[pt_mask&at_mask],
transform=ccrs.PlateCarree(), zorder=999,
marker=marker, color=archive_colour[at],
label=label,#.replace('marine sediment:', 'marine sediment:\n'),
lw=.3, ec='k', s=200)
if at not in major_archives: ijk+=1
plt.legend(bbox_to_anchor=(-0.01,-0.01), loc='upper left', ncol=3, fontsize=13.5, framealpha=0)
grid.tight_layout(fig)
utf.save_fig(fig, f'{df.name}_spatial_all', dir=df.name)
Which creates this plot
For further guidance see the interactive notebook.
Filter the dataframe
Under dod2k/notebooks you can find the notebook df_filter.ipynb. This notebook let's you filter the dataframe for a specific metadata type, e.g. for moisture sensitive records, or for tree/TRW type records, etc.
This notebook then saves the filtered dataframe as a compact dataframe under dod2k/data, from which it can be loaded by other notebooks (e.g. df_plot_dod2k.ipynb).
Start by loading the dataframe, then filter using e.g.
# # filter for >>exclusively moisture<< sensitive records only (without t+m)
df_filter = df.loc[(df['climateInterpretation_variable']=='moisture')]
The resulting dataframe can then be saved and used as input for df_info.ipynb or df_plot_dod2k.ipynb, or you can add this line to another notebook if you prefer.
For further guidance see the interactive notebook.