Load the original databases and merge to a common dataframe.
This tutorial shows you how to load the original databases from source and merge them to a common dataframe (ready for duplicate detection).
Load the original databases from source
The original databases are each loaded via their individual load notebooks (load_ds.ipynb, ds as below). These notebooks load each database from source and create a standardised 'compact' pandas dataframe, with the following columns:
archiveTypedataSetNamedatasetIdgeo_meanElevgeo_meanLatgeo_meanLongeo_siteNameinterpretation_direction(new in v2.0)interpretation_variableinterpretation_variableDetailinterpretation_seasonality(new in v2.0)originalDataURLoriginalDatabasepaleoData_notespaleoData_proxypaleoData_sensorSpeciespaleoData_unitspaleoData_valuespaleoData_variableNameyearyearUnits
The intermediate data is saved in the data directory, where each database has its own subdirectory. For any database db, the output is saved in data/db under:
db_compact.pkl:picklefile for easy python processing, read withpd.DataFrame(pd.read_pickle(db_compact.pkl))db_compact_metadata.csv:csvfile which includes all the metadata associated with each recorddb_compact_year.csv:csvfile which includes the time coordinates for each recorddb_compact_paleoData_values.csv:csvfile which includes the actual record data for each record
Further information about each individual load notebook can be found here:
Load PAGES 2k data from source
The interactive notebook to load PAGES 2k can be found here: load_pages2k.ipynb.
This notebook loads PAGES 2k data from LiPDverse, currently version 2.2.0, and creates a standardised 'compact dataframe'.
In order to load the data from source please activate the following cell:
# Download the file
!wget -O data/pages2k/Pages2kTemperature2_2_0.pkl https://lipdverse.org/Pages2kTemperature/current_version/Pages2kTemperature2_2_0.pkl
This will download the pickle file Pages2kTemperature2_2_0.pkl into data/pages2k.
Subsequently, run the interactive notebook to create a set of csv files containing the compact dataframe.
Load FE23 data from source
The interactive notebook to load FE23 can be found here: load_fe23.ipynb.
This notebook loads FE23 from the NCEI and creates a standardised 'compact dataframe'.
In order to load the data from source please activate the following cell:
# download and unzip FE23
!wget -O /data/fe23/franke2022-fe23.nc https://www.ncei.noaa.gov/pub/data/paleo/contributions_by_author/franke2022/franke2022-fe23.nc
fe23_full = xr.open_dataset('fe23/franke2022-fe23.nc')
# save slice of FE23 with only relevant variables as netCDF (fe23_full is 25GB)
fe23_slice = fe23_full[vars]
fe23_slice.to_netcdf('data/fe23/franke2022-fe23_slice.nc')
This will download the netCDF file franke2022-fe23.nc into data/fe23. Note that this is a very large dataset (around 25GB) so it might be useful for the operator to slice a small list of variables of the large file and save it instead.
Subsequently, run the interactive notebook to create a set of csv files containing the compact dataframe.
Load Iso2k data from source
The interactive notebook to load Iso2k can be found here: load_iso2k.ipynb.
This notebook loads Iso2k from the LiPDverse and creates a standardised 'compact dataframe'.
In order to load the data from source please activate the following cell:
# Download the file (use -O to specify output filename)
!wget -O data/iso2k/iso2k1_1_2.zip https://lipdverse.org/iso2k/current_version/iso2k1_1_2.zip
# Unzip to the correct destination
!unzip data/iso2k/iso2k1_1_2.zip -d data/iso2k/iso2k1_1_2
This will download the zip file iso2k1_1_2.zip into data/iso2k and unzip into the directory data/iso2k/iso2k1_1_2.
Subsequently, run the interactive notebook to create a set of csv files containing the compact dataframe.
Load SISAL data from source
The interactive notebook to load SISAL can be found here: load_sisal.ipynb.
This notebook loads SISAL from a set of csv files.
The current version (v3) was made available to the authorsd but is not currently uploaded on the LiPDverse homepage yet. We will update the notebook and documentation accordingly once there are updates on the public availability.
Subsequently, run the interactive notebook to create a set of csv files containing the compact dataframe.
Load CoralHydro2k data from source
The interactive notebook to load CoralHydro2k can be found here: load_ch2k.ipynb.
This notebook loads CoralHydro2k from the LiPDverse and creates a standardised 'compact dataframe'.
In order to load the data from source please activate the following cell:
# Download the file (use -O to specify output filename)
!wget -O data/ch2k/CoralHydro2k1_0_1.zip https://lipdverse.org/CoralHydro2k/current_version/CoralHydro2k1_0_1.zip
# Unzip to the correct destination
!unzip data/ch2k/CoralHydro2k1_0_1.zip -d data/ch2k/ch2k_101
This will download the zip file CoralHydro2k1_0_1.zip into data/ch2k and unzip into the directory data/ch2k/ch2k_101.
Subsequently, run the interactive notebook to create a set of csv files containing the compact dataframe.
Merge the databases to a common dataframe.
The individual databases can be merged into a common dataframe by loading all the compact dataframes, created in the load notebooks (see previous section).
Simply run the notebook merge_databases.ipynb. This notebook loads the compact dataframes and concatenates them via
# read compact dataframes from all the single databases
dataset_names = ['pages2k', 'fe23', 'ch2k', 'iso2k', 'sisal' ]
print(dataset_names[0])
df = utf.load_compact_dataframe_from_csv(dataset_names[0])
print('length: ', len(df))
for ii, dn in enumerate(dataset_names[1:]):
print(f'add {dn}')
new_df = utf.load_compact_dataframe_from_csv(dn)
df = pd.concat([df, new_df])
print('length: ', len(df))
print('---------------')
print('RESULT:')
print(df.info())
which creates the following dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 5879 entries, 0 to 545
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 archiveType 5879 non-null object
1 dataSetName 5879 non-null object
2 datasetId 5879 non-null object
3 geo_meanElev 5780 non-null float32
4 geo_meanLat 5879 non-null float32
5 geo_meanLon 5879 non-null float32
6 geo_siteName 5879 non-null object
7 interpretation_direction 5879 non-null object
8 interpretation_seasonality 5879 non-null object
9 interpretation_variable 5879 non-null object
10 interpretation_variableDetail 5879 non-null object
11 originalDataURL 5879 non-null object
12 originalDatabase 5879 non-null object
13 paleoData_notes 5879 non-null object
14 paleoData_proxy 5879 non-null object
15 paleoData_sensorSpecies 5879 non-null object
16 paleoData_units 5879 non-null object
17 paleoData_values 5879 non-null object
18 paleoData_variableName 5879 non-null object
19 year 5879 non-null object
20 yearUnits 5879 non-null object
dtypes: float32(3), object(18)
memory usage: 941.6+ KB
None
This dataframe is saved in the directory data/all_merged, and can be used as input for duplicate detection.