Filter compact dataframe¶
This file reads the compact dataframes and filters for specific records (e.g. for moisture sensitive records). The filtered dataset is saved in a separate directory and can be loaded for further analysis or plotting etc.
Author: Lucie Luecke
Date produced: 21/01/2025
Input: reads dataframe with the following keys:
archiveTypedataSetNamedatasetIdgeo_meanElevgeo_meanLatgeo_meanLongeo_siteNameinterpretation_direction(new in v2.0)interpretation_variableinterpretation_variableDetailinterpretation_seasonality(new in v2.0)originalDataURLoriginalDatabasepaleoData_notespaleoData_proxypaleoData_sensorSpeciespaleoData_unitspaleoData_valuespaleoData_variableNameyearyearUnits- (optional:
DuplicateDetails)
Set up working environment¶
Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k':
repo_root = current_dir
elif current_dir.parent.name == 'dod2k':
repo_root = current_dir.parent
else:
raise Exception('Please review the repo root structure (see first cell).')
# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
os.chdir(repo_root)
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
print(f"Working directory matches repo root. ")
Repo root: /home/skidush/PaleoCoLab/dod2k Working directory matches repo root.
import pandas as pd
import numpy as np
from dod2k_utilities import ut_functions as utf # contains utility functions
read dataframe¶
Read compact dataframe.
{db_name} refers to the database, including e.g.
- database of databases:
- dod2k_v2.0 (dod2k: duplicate free, merged database)
- all_merged (NOT filtered for duplicates, only fusion of the input databases)
- original databases:
- fe23
- ch2k
- sisal
- pages2k
- iso2k
All compact dataframes are saved in {repo_root}/data/{db_name} as {db_name}_compact.csv.
db_name = 'dod2k_v2.0'
df = utf.load_compact_dataframe_from_csv(db_name)
print(df.originalDatabase.unique())
df.name = db_name
print(df.info())
['PAGES 2k v2.2.0' 'FE23 (Breitenmoser et al. (2014))' 'CoralHydro2k v1.0.1' 'Iso2k v1.1.2' 'SISAL v3' 'dod2k_composite_z'] <class 'pandas.core.frame.DataFrame'> RangeIndex: 4781 entries, 0 to 4780 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4781 non-null object 1 dataSetName 4781 non-null object 2 datasetId 4781 non-null object 3 duplicateDetails 4781 non-null object 4 geo_meanElev 4699 non-null float32 5 geo_meanLat 4781 non-null float32 6 geo_meanLon 4781 non-null float32 7 geo_siteName 4781 non-null object 8 interpretation_direction 4781 non-null object 9 interpretation_seasonality 4781 non-null object 10 interpretation_variable 4781 non-null object 11 interpretation_variableDetail 4781 non-null object 12 originalDataURL 4781 non-null object 13 originalDatabase 4781 non-null object 14 paleoData_notes 4781 non-null object 15 paleoData_proxy 4781 non-null object 16 paleoData_sensorSpecies 4781 non-null object 17 paleoData_units 4781 non-null object 18 paleoData_values 4781 non-null object 19 paleoData_variableName 4781 non-null object 20 year 4781 non-null object 21 yearUnits 4781 non-null object dtypes: float32(3), object(19) memory usage: 765.8+ KB None
filter dataframe for specific record types¶
Here you can filter the dataframe for specific record types. Below is an example where we filter for interpretation_variable=temperature.
This could be done with any column and any value (e.g. for a specific archive type, etc.)
Please look at the examples below which are commented out for future use
# if you want to filter for specific metadata, e.g. temperature or moisture records, run this:
# ---> interpretation_variable
# e.g.
# # filter for >>moisture<< sensitive records only (also include records which are moisture and temperature sensitive)
df_filter = df.loc[(df['interpretation_variable']=='moisture')|(df['interpretation_variable']=='temperature+moisture')]
df_filter.name = db_name + "_filtered_M_TM"
# # filter for >>exclusively moisture<< sensitive records only (without t+m)
# df_filter = df.loc[(df['interpretation_variable']=='moisture')]
# df_filter.name = db_name + "_filtered_M"
# # filter for >>temperature<< sensitive records only (also include records which are moisture and temperature sensitive)
# df_filter = df.loc[(df['interpretation_variable']=='temperature')|(df['interpretation_variable']=='temperature+moisture')]
# df_filter.name = db_name + "_filtered_T_TM"
# # filter for >>exclusively temperature<< sensitive records only (without t+m)
# df_filter = df.loc[(df['interpretation_variable']=='temperature')]
# df_filter.name = db_name + "_filtered_T"
# ---> archiveType and paleoData_proxy
# e.g.
# # filter for specific proxy type, e.g. archiveType='speleothem' and paleoData_proxy='d18O'
# df_filter = df.loc[(df['archiveType']=='speleothem')&(df['paleoData_proxy']=='d18O')]
# df_filter.name = db_name + "_filtered_speleo_d18O"
# # filter for specific proxy type, e.g. archiveType='speleothem' only
# df_filter = df.loc[(df['archiveType']=='speleothem')]
# df_filter.name = db_name + "_filtered_speleothem"
# ---> paleoData_proxy only
# e.g.
# df_filter = df.loc[(df['paleoData_proxy']=='MXD')]
# etc.
IMPORTANT: the database name needs to be adjusted according to the filtering.
Please add an identifier to the dataframe name which will be used for displaying and savng the data.
Make sure it is different from the original db_name.
As df.name is used for saving the filtered data it is crucial that it differs from the original db_name otherwise the data will get overwritten!
# df needs name reassigned as it gets lost otherwise after assigning new value to df (through the filtering above)
# for the M+T filtered example, revise df.name to _filtered_MT
print(df_filter.name)
assert df_filter.name!=db_name
dod2k_v2.0_filtered_M_TM
Display the filtered dataframe
print(df_filter.info())
<class 'pandas.core.frame.DataFrame'> Index: 1416 entries, 260 to 4777 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 1416 non-null object 1 dataSetName 1416 non-null object 2 datasetId 1416 non-null object 3 duplicateDetails 1416 non-null object 4 geo_meanElev 1398 non-null float32 5 geo_meanLat 1416 non-null float32 6 geo_meanLon 1416 non-null float32 7 geo_siteName 1416 non-null object 8 interpretation_direction 1416 non-null object 9 interpretation_seasonality 1416 non-null object 10 interpretation_variable 1416 non-null object 11 interpretation_variableDetail 1416 non-null object 12 originalDataURL 1416 non-null object 13 originalDatabase 1416 non-null object 14 paleoData_notes 1416 non-null object 15 paleoData_proxy 1416 non-null object 16 paleoData_sensorSpecies 1416 non-null object 17 paleoData_units 1416 non-null object 18 paleoData_values 1416 non-null object 19 paleoData_variableName 1416 non-null object 20 year 1416 non-null object 21 yearUnits 1416 non-null object dtypes: float32(3), object(19) memory usage: 237.8+ KB None
save filtered dataframe¶
Saves the filtered dataframe in:
{repo_root}/data/{df_filter.name}
# create new directory if dir does not exist
path = '/data/'+df_filter.name
os.makedirs(os.getcwd()+path, exist_ok = True)
# save as pickle
df_filter.to_pickle(f'data/{df_filter.name}/{df_filter.name}_compact.pkl')
# save csv
utf.write_compact_dataframe_to_csv(df_filter)
METADATA: datasetId, archiveType, dataSetName, duplicateDetails, geo_meanElev, geo_meanLat, geo_meanLon, geo_siteName, interpretation_direction, interpretation_seasonality, interpretation_variable, interpretation_variableDetail, originalDataURL, originalDatabase, paleoData_notes, paleoData_proxy, paleoData_sensorSpecies, paleoData_units, paleoData_variableName, yearUnits Saved to /home/skidush/PaleoCoLab/dod2k/data/dod2k_v2.0_filtered_M_TM/dod2k_v2.0_filtered_M_TM_compact_%s.csv
# load dataframe
utf.load_compact_dataframe_from_csv(df_filter.name).info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1416 entries, 0 to 1415 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 1416 non-null object 1 dataSetName 1416 non-null object 2 datasetId 1416 non-null object 3 duplicateDetails 1416 non-null object 4 geo_meanElev 1398 non-null float32 5 geo_meanLat 1416 non-null float32 6 geo_meanLon 1416 non-null float32 7 geo_siteName 1416 non-null object 8 interpretation_direction 1416 non-null object 9 interpretation_seasonality 1416 non-null object 10 interpretation_variable 1416 non-null object 11 interpretation_variableDetail 1416 non-null object 12 originalDataURL 1416 non-null object 13 originalDatabase 1416 non-null object 14 paleoData_notes 1416 non-null object 15 paleoData_proxy 1416 non-null object 16 paleoData_sensorSpecies 1416 non-null object 17 paleoData_units 1416 non-null object 18 paleoData_values 1416 non-null object 19 paleoData_variableName 1416 non-null object 20 year 1416 non-null object 21 yearUnits 1416 non-null object dtypes: float32(3), object(19) memory usage: 226.9+ KB