Duplicate detection - step 3: remove true duplicates¶
This notebook runs the third part of the duplicate detection algorithm on a dataframe with the following columns:
archiveType(used for duplicate detection algorithm)dataSetNamedatasetIdgeo_meanElev(used for duplicate detection algorithm)geo_meanLat(used for duplicate detection algorithm)geo_meanLon(used for duplicate detection algorithm)geo_siteName(used for duplicate detection algorithm)interpretation_directioninterpretation_seasonalityinterpretation_variableinterpretation_variableDetailsoriginalDataURLoriginalDatabasepaleoData_notespaleoData_proxy(used for duplicate detection algorithm)paleoData_unitspaleoData_values(used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)paleoData_variableNameyear(used for duplicate detection algorithm)yearUnits- This interactive notebook (
dup_removal.ipynb) removes the duplicates flagged indup_detection.ipynb, following the decisions made indup_decision.ipynb. The decisions include - removal of redundant duplicates
- creation of composites
Based on the operator decisions as specified in data/DATABASENAME/duplicate_detection/duplicate_decisions_DATABASENAME_AUTHORINITIALS_YY-MM-DD.csv.
Ultimately a duplicate free dataframe is saved under
data/DATABASENAME/DATABASENAME_dupfree.pkldata/DATABASENAME/DATABASENAME_dupfree_data.csvdata/DATABASENAME/DATABASENAME_dupfree_year.csvdata/DATABASENAME/DATABASENAME_dupfree_metadata.csv
10/11/2025 by LL: tidied up with revised data organisation and prepared for documentation 02/12/2024 by LL: Modified the compositing process for metadata to fix bugs and make it more user friendly. Added some extra information to the bottom of the file (prior to the figures).
22/10/2024 by LL: add the composite option for duplicates (create z-scores and average over shared time period) 30/09/2024 by LL: keep all original database values for removeed duplicates with more than one original database
Author: Lucie Luecke, created 27/9/2024
Intialisation¶
Set up working environment¶
Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')
# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
os.chdir(repo_root)
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
print(f"Working directory matches repo root. ")
Repo root: /home/jupyter-lluecke/dod2k Working directory matches repo root.
import pandas as pd
import numpy as np
import datetime
from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions
Load dataset¶
Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).
The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be
pages2kch2kiso2ksisalfe23
for the individual databases, or
all_merged
to load the merged database of all individual databases, or can be any user defined compact dataframe.
# load dataframe
db_name='all_merged'
# db_name='dup_test'
df = utf.load_compact_dataframe_from_csv(db_name)
print(df.info())
df.name = db_name
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5147 entries, 0 to 5146 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 5147 non-null object 1 dataSetName 5147 non-null object 2 datasetId 5147 non-null object 3 geo_meanElev 5048 non-null float32 4 geo_meanLat 5147 non-null float32 5 geo_meanLon 5147 non-null float32 6 geo_siteName 5147 non-null object 7 interpretation_direction 5147 non-null object 8 interpretation_seasonality 5147 non-null object 9 interpretation_variable 5147 non-null object 10 interpretation_variableDetail 5147 non-null object 11 originalDataURL 5147 non-null object 12 originalDatabase 5147 non-null object 13 paleoData_notes 5147 non-null object 14 paleoData_proxy 5147 non-null object 15 paleoData_sensorSpecies 5147 non-null object 16 paleoData_units 5147 non-null object 17 paleoData_values 5147 non-null object 18 paleoData_variableName 5147 non-null object 19 year 5147 non-null object 20 yearUnits 5147 non-null object dtypes: float32(3), object(18) memory usage: 784.2+ KB None
Set datasetId as dataframe index to reliably identify the duplicates:
df.set_index('datasetId', inplace = True)
df['datasetId']=df.index
Input operator's credentials¶
In order to keep maximum transparency and reproduceability, put in the operator's credentials here.
These details are used to flag the intermediate output files and provided along with the final duplicate free dataset.
# initials = 'LL'
# fullname = 'Lucie Luecke'
# email = 'ljluec1@st-andrews.ac.uk'
initials = 'MNE'
fullname = 'Michael Evans'
email = 'mnevans@umd.edu'
operator_details = [initials, fullname, email]
Apply duplicate decisions to dataframe¶
Load duplicate decisions from csv¶
Please specify the date of the decision process below. The decision output file is then loaded from data/DBNAME/dup_detection/dup_decisions_DBNAME_INITIALS_DATE.csv.
# date = str(datetime.datetime.utcnow())[2:10]
# date='25-12-11' # for Lucie's dup_test decisions
date='25-12-19' # for Mike's latest all_merged decisions
# date='25-12-11' # for Lucie's latest all_merged decisions
filename = f'data/{df.name}/dup_detection/dup_decisions_{df.name}_{initials}_{date}'
data, header = dup.read_csv(filename, header=True)
df_decisions = pd.read_csv(filename+'.csv', header=5)
for hh in header:
print(hh)
print(df_decisions.columns)
print(df.name)
Decisions for duplicate candidate pairs.
Operated by Michael Evans (MNE)
E-Mail: mnevans@umd.edu
Created on: 2025-12-19 23:14:43.155509 (UTC)
updated for PAGES2k v2.2.0, Iso2k v1.1.0, CoralHydro2k v1.0.1 all from https://lipdverse.org/ . Checked: Cobb et al (2003, 2013); Dee et al (2020) composite record for Palmyra is updated (https://lipdverse.org/data/XHtmwaqI1qgs2CQxrXgK/1_0_8//CO03COPM.jsonld) in Pages2k v2.2.0 lipdverse compilation.
Index(['index 1', 'index 2', 'figure path', 'datasetId 1', 'datasetId 2',
'originalDatabase 1', 'originalDatabase 2', 'geo_siteName 1',
'geo_siteName 2', 'geo_meanLat 1', 'geo_meanLat 2', 'geo_meanLon 1',
'geo_meanLon 2', 'geo_meanElevation 1', 'geo_meanElevation 2',
'archiveType 1', 'archiveType 2', 'paleoData_proxy 1',
'paleoData_proxy 2', 'originalDataURL 1', 'originalDataURL 2', 'year 1',
'year 2', 'Decision 1', 'Decision 2', 'Decision type',
'Decision comment'],
dtype='object')
all_merged
Collect the decisions associated with each candidate ID (not as pair), to identify records associated with multiple decisions.
# Collect decisions for each record
decisions = dup.collect_record_decisions(df_decisions)
Show the decisions associated with each individual record
for rr, dd in decisions.items():
print(rr, dd)
pages2k_0 ['REMOVE', 'REMOVE', 'REMOVE'] iso2k_296 ['KEEP', 'KEEP', 'KEEP'] iso2k_298 ['KEEP', 'REMOVE', 'KEEP'] iso2k_299 ['KEEP', 'REMOVE', 'REMOVE'] pages2k_6 ['REMOVE'] FE23_northamerica_usa_az555 ['KEEP'] pages2k_50 ['REMOVE'] FE23_northamerica_canada_cana091 ['KEEP'] pages2k_62 ['KEEP'] pages2k_63 ['REMOVE'] pages2k_81 ['REMOVE', 'REMOVE'] ch2k_HE08LRA01_76 ['KEEP', 'REMOVE'] iso2k_1813 ['KEEP', 'KEEP'] pages2k_83 ['REMOVE'] iso2k_1916 ['KEEP'] pages2k_85 ['KEEP'] pages2k_88 ['REMOVE'] pages2k_94 ['REMOVE'] FE23_northamerica_canada_cana153 ['KEEP'] pages2k_107 ['REMOVE'] FE23_northamerica_usa_ak046 ['KEEP'] pages2k_121 ['KEEP'] pages2k_122 ['REMOVE'] pages2k_132 ['REMOVE'] FE23_northamerica_canada_cana225 ['KEEP'] pages2k_158 ['REMOVE'] FE23_northamerica_usa_wa069 ['KEEP'] pages2k_171 ['REMOVE'] FE23_northamerica_usa_wy021 ['KEEP'] pages2k_203 ['REMOVE'] iso2k_826 ['KEEP'] pages2k_225 ['REMOVE'] FE23_northamerica_usa_nv512 ['KEEP', 'REMOVE'] pages2k_238 ['REMOVE'] iso2k_1044 ['KEEP'] pages2k_242 ['REMOVE', 'REMOVE'] ch2k_LI06FIJ01_582 ['KEEP', 'REMOVE'] iso2k_353 ['KEEP', 'KEEP'] pages2k_258 ['REMOVE'] iso2k_1498 ['KEEP'] pages2k_263 ['REMOVE'] iso2k_1322 ['KEEP'] pages2k_267 ['KEEP', 'REMOVE'] iso2k_58 ['REMOVE', 'REMOVE'] iso2k_1068 ['KEEP', 'REMOVE'] pages2k_271 ['REMOVE', 'REMOVE'] ch2k_FE18RUS01_492 ['KEEP', 'REMOVE'] iso2k_1861 ['KEEP', 'KEEP'] pages2k_273 ['REMOVE'] FE23_asia_russ130w ['KEEP'] pages2k_281 ['REMOVE'] FE23_northamerica_canada_cana155 ['KEEP'] pages2k_294 ['REMOVE'] FE23_northamerica_usa_ak021 ['KEEP'] pages2k_305 ['REMOVE'] pages2k_309 ['KEEP'] pages2k_307 ['REMOVE'] pages2k_311 ['KEEP'] pages2k_315 ['REMOVE'] iso2k_362 ['KEEP'] pages2k_317 ['REMOVE', 'REMOVE'] ch2k_NA09MAL01_84 ['KEEP', 'REMOVE'] iso2k_1754 ['KEEP', 'KEEP'] pages2k_323 ['REMOVE'] FE23_northamerica_canada_cana210 ['KEEP'] pages2k_385 ['REMOVE', 'REMOVE'] ch2k_FE09OGA01_304 ['KEEP', 'REMOVE'] iso2k_1922 ['KEEP', 'KEEP'] pages2k_387 ['REMOVE'] ch2k_FE09OGA01_306 ['KEEP'] pages2k_395 ['REMOVE', 'REMOVE'] ch2k_CA07FLI01_400 ['KEEP', 'REMOVE'] iso2k_1057 ['KEEP', 'KEEP'] pages2k_397 ['REMOVE'] ch2k_CA07FLI01_402 ['KEEP'] pages2k_409 ['REMOVE', 'REMOVE'] ch2k_QU96ESV01_422 ['KEEP', 'REMOVE'] iso2k_218 ['KEEP', 'KEEP'] pages2k_414 ['KEEP'] pages2k_418 ['REMOVE'] pages2k_417 ['KEEP'] pages2k_421 ['REMOVE'] pages2k_427 ['KEEP'] pages2k_433 ['REMOVE'] pages2k_435 ['KEEP'] pages2k_842 ['REMOVE'] pages2k_444 ['KEEP', 'KEEP'] pages2k_445 ['REMOVE', 'KEEP'] pages2k_446 ['REMOVE', 'REMOVE'] pages2k_462 ['REMOVE', 'REMOVE'] ch2k_OS14UCP01_236 ['KEEP', 'REMOVE'] iso2k_350 ['KEEP', 'KEEP'] pages2k_468 ['KEEP', 'REMOVE'] pages2k_3550 ['REMOVE', 'REMOVE'] FE23_asia_russ137w ['KEEP', 'KEEP'] pages2k_472 ['KEEP', 'KEEP'] pages2k_474 ['REMOVE', 'REMOVE'] pages2k_477 ['REMOVE', 'REMOVE'] pages2k_478 ['REMOVE'] iso2k_1846 ['KEEP'] pages2k_486 ['REMOVE'] FE23_northamerica_usa_ca609 ['KEEP'] pages2k_495 ['REMOVE', 'REMOVE'] ch2k_LI06RAR01_12 ['KEEP', 'REMOVE'] iso2k_1502 ['KEEP', 'KEEP'] pages2k_500 ['REMOVE', 'REMOVE'] ch2k_AS05GUA01_302 ['KEEP', 'REMOVE'] iso2k_1559 ['KEEP', 'KEEP'] pages2k_541 ['REMOVE'] iso2k_404 ['KEEP'] pages2k_543 ['KEEP'] pages2k_976 ['REMOVE'] pages2k_565 ['REMOVE'] iso2k_998 ['KEEP'] pages2k_583 ['REMOVE'] FE23_northamerica_usa_mt116 ['KEEP'] pages2k_592 ['REMOVE', 'REMOVE'] ch2k_LI06RAR02_270 ['KEEP', 'REMOVE'] iso2k_1500 ['KEEP', 'KEEP'] pages2k_610 ['REMOVE'] iso2k_1199 ['KEEP'] pages2k_626 ['REMOVE'] FE23_northamerica_usa_wa071 ['KEEP'] pages2k_691 ['REMOVE'] FE23_northamerica_canada_cana062 ['KEEP'] pages2k_730 ['REMOVE'] iso2k_396 ['KEEP'] pages2k_736 ['REMOVE'] FE23_northamerica_usa_wy024 ['KEEP'] pages2k_800 ['REMOVE'] FE23_northamerica_canada_cana234 ['KEEP'] pages2k_818 ['REMOVE'] iso2k_488 ['KEEP'] pages2k_827 ['KEEP'] pages2k_830 ['REMOVE'] pages2k_831 ['KEEP', 'REMOVE'] pages2k_2220 ['REMOVE', 'REMOVE'] FE23_asia_russ127w ['KEEP', 'KEEP'] pages2k_857 ['REMOVE'] FE23_northamerica_usa_ut511 ['KEEP'] pages2k_881 ['REMOVE'] iso2k_1010 ['KEEP', 'KEEP'] pages2k_893 ['KEEP', 'KEEP'] pages2k_895 ['REMOVE', 'KEEP'] pages2k_900 ['REMOVE', 'REMOVE'] pages2k_940 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_DR99ABR01_264 ['KEEP', 'KEEP', 'REMOVE'] ch2k_DR99ABR01_266 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_91 ['KEEP', 'KEEP', 'KEEP'] pages2k_945 ['REMOVE'] iso2k_100 ['KEEP'] pages2k_960 ['REMOVE'] iso2k_641 ['KEEP'] pages2k_982 ['REMOVE'] FE23_northamerica_usa_or042 ['KEEP'] pages2k_1004 ['REMOVE'] iso2k_644 ['KEEP'] pages2k_1026 ['REMOVE'] FE23_northamerica_usa_az553 ['KEEP'] pages2k_1048 ['REMOVE'] iso2k_1212 ['KEEP'] pages2k_1089 ['REMOVE', 'REMOVE'] FE23_northamerica_usa_mt112 ['KEEP', 'KEEP'] FE23_northamerica_usa_mt113 ['KEEP', 'REMOVE'] pages2k_1108 ['REMOVE'] iso2k_1060 ['KEEP'] pages2k_1116 ['REMOVE'] FE23_northamerica_canada_cana170w ['KEEP'] pages2k_1147 ['KEEP', 'KEEP', 'REMOVE'] ch2k_DA06MAF01_78 ['REMOVE', 'REMOVE'] ch2k_DA06MAF02_104 ['REMOVE', 'REMOVE'] iso2k_1748 ['KEEP', 'KEEP', 'KEEP'] pages2k_1153 ['KEEP', 'KEEP'] pages2k_1156 ['REMOVE', 'KEEP'] pages2k_1160 ['REMOVE', 'REMOVE'] pages2k_1209 ['REMOVE'] FE23_northamerica_usa_co553 ['KEEP'] pages2k_1252 ['REMOVE'] FE23_northamerica_canada_cana096 ['KEEP'] pages2k_1274 ['REMOVE'] iso2k_1577 ['KEEP'] pages2k_1293 ['REMOVE'] iso2k_821 ['KEEP'] pages2k_1325 ['REMOVE'] FE23_northamerica_usa_wy030 ['KEEP'] pages2k_1360 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_UR00MAI01_22 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_94 ['KEEP', 'KEEP', 'KEEP'] iso2k_98 ['KEEP', 'KEEP', 'REMOVE'] pages2k_1362 ['KEEP'] pages2k_1365 ['REMOVE'] pages2k_1370 ['REMOVE'] iso2k_1619 ['KEEP'] pages2k_1420 ['REMOVE'] FE23_northamerica_canada_cana111 ['KEEP'] pages2k_1442 ['KEEP'] pages2k_1444 ['REMOVE'] pages2k_1488 ['REMOVE', 'REMOVE', 'REMOVE', 'REMOVE'] pages2k_1628 ['KEEP', 'KEEP', 'KEEP', 'KEEP'] ch2k_NU11PAL01_52 ['KEEP', 'REMOVE', 'REMOVE', 'REMOVE'] iso2k_505 ['REMOVE', 'REMOVE', 'REMOVE', 'REMOVE'] iso2k_579 ['KEEP', 'REMOVE', 'KEEP', 'REMOVE'] pages2k_1490 ['REMOVE'] ch2k_NU11PAL01_54 ['KEEP'] pages2k_1491 ['REMOVE'] iso2k_575 ['KEEP'] pages2k_1497 ['REMOVE'] iso2k_1885 ['KEEP'] pages2k_1515 ['KEEP'] pages2k_1519 ['REMOVE'] pages2k_1520 ['REMOVE'] pages2k_1522 ['KEEP'] pages2k_1547 ['REMOVE'] iso2k_259 ['KEEP'] pages2k_1566 ['REMOVE'] FE23_northamerica_canada_cana231 ['KEEP'] pages2k_1605 ['REMOVE'] FE23_northamerica_usa_ca606 ['KEEP'] pages2k_1619 ['KEEP'] pages2k_1623 ['REMOVE'] pages2k_1636 ['REMOVE'] FE23_northamerica_usa_wa081 ['KEEP'] pages2k_1686 ['KEEP'] pages2k_1688 ['REMOVE'] pages2k_1692 ['REMOVE'] FE23_asia_mong012 ['KEEP'] pages2k_1703 ['REMOVE', 'REMOVE'] ch2k_MO06PED01_226 ['KEEP', 'REMOVE'] iso2k_629 ['KEEP', 'KEEP'] pages2k_1712 ['REMOVE'] iso2k_715 ['KEEP'] pages2k_1720 ['REMOVE'] iso2k_1579 ['KEEP'] pages2k_1741 ['REMOVE'] FE23_northamerica_usa_wa104 ['KEEP'] pages2k_1750 ['REMOVE', 'REMOVE'] iso2k_1856 ['KEEP', 'KEEP'] sisal_294.0_194 ['KEEP', 'REMOVE'] pages2k_1771 ['REMOVE'] ch2k_TU01LAI01_192 ['KEEP'] pages2k_1804 ['REMOVE'] FE23_northamerica_usa_me010 ['KEEP'] pages2k_1859 ['REMOVE', 'REMOVE'] ch2k_HE10GUA01_244 ['KEEP', 'REMOVE'] iso2k_1735 ['KEEP', 'KEEP'] pages2k_1861 ['REMOVE'] ch2k_HE10GUA01_246 ['KEEP'] pages2k_1880 ['REMOVE'] FE23_northamerica_usa_ak060 ['KEEP'] pages2k_1891 ['REMOVE'] pages2k_1893 ['KEEP'] pages2k_1918 ['REMOVE'] iso2k_102 ['KEEP'] pages2k_1920 ['KEEP'] pages2k_1923 ['REMOVE'] pages2k_1932 ['REMOVE'] pages2k_1934 ['KEEP'] pages2k_1942 ['REMOVE', 'REMOVE'] ch2k_ZI04IFR01_26 ['KEEP', 'REMOVE'] iso2k_257 ['KEEP', 'KEEP'] pages2k_1972 ['KEEP'] pages2k_1973 ['REMOVE'] pages2k_1976 ['KEEP'] pages2k_1980 ['REMOVE'] pages2k_1978 ['KEEP'] pages2k_1983 ['REMOVE'] pages2k_1985 ['REMOVE'] iso2k_1294 ['KEEP'] pages2k_1989 ['KEEP'] pages2k_1991 ['REMOVE'] pages2k_1994 ['REMOVE'] ch2k_DE12ANC01_258 ['KEEP'] pages2k_2013 ['REMOVE'] FE23_northamerica_canada_cana097 ['KEEP'] pages2k_2042 ['REMOVE', 'REMOVE'] ch2k_TU95MAD01_24 ['KEEP', 'REMOVE'] iso2k_20 ['KEEP', 'KEEP'] pages2k_2059 ['REMOVE'] FE23_northamerica_usa_ak058 ['KEEP'] pages2k_2085 ['REMOVE'] FE23_northamerica_canada_cana002 ['KEEP'] pages2k_2094 ['REMOVE', 'REMOVE'] ch2k_TU01DEP01_450 ['KEEP', 'REMOVE'] iso2k_1201 ['KEEP', 'KEEP'] pages2k_2098 ['KEEP'] pages2k_2103 ['REMOVE'] pages2k_2110 ['REMOVE'] FE23_northamerica_usa_co554 ['KEEP'] pages2k_2146 ['KEEP', 'KEEP'] pages2k_2149 ['REMOVE', 'KEEP'] pages2k_2150 ['REMOVE', 'REMOVE'] pages2k_2156 ['REMOVE'] FE23_northamerica_canada_cana169w ['KEEP'] pages2k_2214 ['REMOVE'] iso2k_1631 ['KEEP'] pages2k_2226 ['REMOVE'] FE23_asia_mong007w ['KEEP'] pages2k_2265 ['REMOVE'] FE23_northamerica_usa_ak070 ['KEEP'] pages2k_2287 ['KEEP'] pages2k_2290 ['REMOVE'] pages2k_2300 ['REMOVE'] ch2k_OS14RIP01_174 ['KEEP'] pages2k_2303 ['REMOVE'] FE23_asia_mong006 ['KEEP'] pages2k_2309 ['REMOVE'] ch2k_WE09ARR01_208 ['KEEP'] pages2k_2311 ['REMOVE'] ch2k_WE09ARR01_210 ['KEEP'] pages2k_2319 ['REMOVE'] FE23_northamerica_usa_ak6 ['KEEP'] pages2k_2339 ['KEEP'] pages2k_2344 ['REMOVE'] pages2k_2361 ['REMOVE'] FE23_northamerica_usa_wa097 ['KEEP'] pages2k_2402 ['REMOVE'] FE23_northamerica_usa_co586 ['KEEP'] pages2k_2430 ['REMOVE'] FE23_northamerica_canada_cana113 ['KEEP'] pages2k_2473 ['REMOVE'] FE23_northamerica_usa_wy022 ['KEEP'] pages2k_2500 ['KEEP'] pages2k_2502 ['REMOVE'] pages2k_2510 ['REMOVE'] iso2k_1626 ['KEEP'] pages2k_2514 ['REMOVE'] iso2k_1467 ['KEEP'] pages2k_2517 ['REMOVE'] iso2k_1130 ['KEEP'] pages2k_2534 ['REMOVE'] iso2k_1575 ['KEEP'] pages2k_2538 ['REMOVE'] iso2k_1862 ['KEEP'] pages2k_2561 ['REMOVE'] FE23_northamerica_canada_cana094 ['KEEP'] pages2k_2592 ['KEEP'] pages2k_2596 ['REMOVE'] pages2k_2595 ['REMOVE'] pages2k_2599 ['REMOVE'] pages2k_2604 ['KEEP', 'REMOVE'] pages2k_2606 ['REMOVE', 'REMOVE'] iso2k_1481 ['KEEP', 'KEEP'] pages2k_2607 ['KEEP', 'KEEP'] pages2k_2609 ['REMOVE', 'KEEP'] pages2k_2612 ['REMOVE', 'REMOVE'] pages2k_2613 ['REMOVE'] iso2k_1470 ['KEEP'] pages2k_2617 ['REMOVE'] iso2k_1573 ['KEEP'] pages2k_2634 ['REMOVE'] FE23_northamerica_usa_id013 ['KEEP'] pages2k_2660 ['REMOVE'] FE23_northamerica_usa_ak014 ['KEEP'] pages2k_2677 ['REMOVE'] FE23_northamerica_usa_wy023 ['KEEP'] pages2k_2703 ['REMOVE'] FE23_northamerica_usa_ak094 ['KEEP'] pages2k_2722 ['REMOVE'] FE23_northamerica_canada_cana238 ['KEEP'] pages2k_2750 ['REMOVE'] iso2k_1708 ['KEEP'] pages2k_2752 ['KEEP', 'KEEP'] pages2k_2755 ['REMOVE', 'KEEP'] pages2k_2759 ['REMOVE', 'REMOVE'] pages2k_2793 ['KEEP'] pages2k_2795 ['REMOVE', 'KEEP'] pages2k_2798 ['REMOVE', 'REMOVE'] pages2k_2796 ['KEEP'] pages2k_2830 ['REMOVE'] FE23_northamerica_mexico_mexi020 ['KEEP'] pages2k_2843 ['REMOVE'] FE23_northamerica_usa_wa083 ['KEEP'] pages2k_2899 ['KEEP'] pages2k_2901 ['REMOVE'] pages2k_2904 ['KEEP'] pages2k_2906 ['REMOVE'] pages2k_2922 ['REMOVE'] FE23_northamerica_usa_ca603 ['KEEP'] pages2k_2953 ['REMOVE'] iso2k_573 ['KEEP'] pages2k_2959 ['REMOVE'] FE23_northamerica_mexico_mexi043 ['KEEP'] pages2k_2976 ['REMOVE'] FE23_northamerica_usa_id008 ['KEEP'] pages2k_3002 ['REMOVE'] FE23_northamerica_usa_or043 ['KEEP'] pages2k_3028 ['KEEP', 'KEEP'] pages2k_3030 ['REMOVE', 'KEEP'] pages2k_3033 ['REMOVE', 'REMOVE'] pages2k_3038 ['REMOVE'] FE23_northamerica_usa_mt108 ['KEEP'] pages2k_3064 ['REMOVE'] iso2k_698 ['KEEP'] pages2k_3068 ['REMOVE', 'REMOVE'] ch2k_ZI14IFR02_522 ['KEEP', 'KEEP'] ch2k_ZI14IFR02_524 ['KEEP', 'REMOVE'] pages2k_3085 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_KU00NIN01_150 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_1554 ['KEEP', 'KEEP', 'REMOVE'] iso2k_1556 ['KEEP', 'KEEP', 'KEEP'] pages2k_3107 ['REMOVE'] FE23_northamerica_usa_co552 ['KEEP', 'KEEP'] pages2k_3108 ['REMOVE'] pages2k_3132 ['REMOVE', 'REMOVE'] ch2k_QU06RAB01_144 ['KEEP', 'REMOVE'] iso2k_1311 ['KEEP', 'KEEP'] pages2k_3134 ['REMOVE'] ch2k_QU06RAB01_146 ['KEEP'] pages2k_3170 ['REMOVE'] FE23_australia_newz062 ['KEEP', 'KEEP'] pages2k_3179 ['REMOVE'] FE23_northamerica_usa_ak057 ['KEEP'] pages2k_3188 ['KEEP'] pages2k_3191 ['REMOVE'] pages2k_3196 ['REMOVE'] FE23_asia_mong011 ['KEEP'] pages2k_3202 ['REMOVE'] iso2k_1727 ['KEEP'] pages2k_3234 ['KEEP', 'KEEP'] pages2k_3236 ['REMOVE', 'KEEP'] pages2k_3239 ['REMOVE', 'REMOVE'] pages2k_3243 ['REMOVE'] iso2k_0 ['KEEP'] pages2k_3263 ['REMOVE'] iso2k_1264 ['KEEP'] pages2k_3266 ['REMOVE', 'REMOVE'] ch2k_GO12SBV01_396 ['KEEP', 'REMOVE'] iso2k_870 ['KEEP', 'KEEP'] pages2k_3307 ['REMOVE'] iso2k_339 ['KEEP'] pages2k_3313 ['REMOVE'] FE23_northamerica_usa_ca560 ['KEEP'] pages2k_3337 ['KEEP'] pages2k_3342 ['REMOVE'] pages2k_3352 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_ZI14TUR01_480 ['KEEP', 'KEEP', 'REMOVE'] ch2k_ZI14TUR01_482 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_302 ['KEEP', 'KEEP', 'KEEP'] pages2k_3372 ['REMOVE', 'REMOVE'] ch2k_KI04MCV01_366 ['KEEP', 'REMOVE'] iso2k_155 ['KEEP', 'KEEP'] pages2k_3374 ['REMOVE'] ch2k_KI04MCV01_368 ['KEEP'] pages2k_3404 ['REMOVE'] FE23_northamerica_canada_cana029 ['KEEP'] pages2k_3417 ['KEEP'] pages2k_3419 ['REMOVE'] pages2k_3503 ['REMOVE'] FE23_northamerica_usa_wa072 ['KEEP'] pages2k_3524 ['REMOVE'] FE23_northamerica_usa_ak010 ['KEEP'] pages2k_3552 ['REMOVE'] iso2k_1581 ['KEEP'] pages2k_3554 ['REMOVE', 'REMOVE'] ch2k_LI94SEC01_436 ['KEEP', 'REMOVE'] iso2k_1124 ['KEEP', 'KEEP'] pages2k_3571 ['REMOVE'] iso2k_174 ['KEEP'] pages2k_3583 ['REMOVE'] FE23_northamerica_usa_co633 ['KEEP'] pages2k_3599 ['REMOVE', 'KEEP'] iso2k_1069 ['KEEP', 'REMOVE'] iso2k_1660 ['KEEP', 'KEEP'] pages2k_3609 ['REMOVE'] FE23_northamerica_canada_cana053 ['KEEP'] pages2k_3631 ['REMOVE'] iso2k_1530 ['KEEP'] pages2k_3642 ['REMOVE'] FE23_northamerica_usa_wy025 ['KEEP'] FE23_southamerica_arge016 ['KEEP'] FE23_southamerica_arge085 ['REMOVE'] FE23_northamerica_canada_cana100 ['REMOVE'] FE23_northamerica_canada_cana213 ['KEEP'] FE23_northamerica_canada_cana105 ['REMOVE'] FE23_northamerica_canada_cana217 ['KEEP'] FE23_northamerica_canada_cana116 ['KEEP'] FE23_northamerica_canada_cana168w ['REMOVE'] FE23_northamerica_canada_cana161 ['KEEP'] FE23_northamerica_canada_cana162 ['REMOVE'] FE23_southamerica_chil016 ['KEEP'] FE23_southamerica_chil017 ['REMOVE'] FE23_europe_swed019w ['COMPOSITE'] FE23_europe_swed021w ['COMPOSITE'] FE23_northamerica_mexico_mexi022 ['REMOVE'] FE23_northamerica_mexico_mexi023 ['KEEP'] FE23_australia_newz003 ['REMOVE'] FE23_australia_newz060 ['KEEP'] FE23_australia_newz008 ['REMOVE'] FE23_australia_newz092 ['KEEP'] FE23_australia_newz014 ['REMOVE'] FE23_australia_newz061 ['KEEP'] FE23_australia_newz018 ['REMOVE'] FE23_australia_newz019 ['REMOVE'] FE23_australia_newz063 ['KEEP'] FE23_northamerica_usa_ca066 ['REMOVE'] FE23_northamerica_usa_ca628 ['KEEP', 'KEEP'] FE23_northamerica_usa_ca067 ['REMOVE'] FE23_northamerica_usa_ca512 ['REMOVE'] FE23_northamerica_usa_ca613 ['KEEP'] FE23_northamerica_usa_ca535 ['REMOVE'] FE23_northamerica_usa_ca670 ['KEEP'] FE23_northamerica_usa_me017 ['REMOVE'] FE23_northamerica_usa_me018 ['KEEP'] FE23_northamerica_usa_mo ['KEEP'] FE23_northamerica_usa_mo009 ['REMOVE'] FE23_northamerica_usa_nj001 ['KEEP'] FE23_northamerica_usa_nj002 ['REMOVE'] FE23_northamerica_usa_nm024 ['KEEP'] FE23_northamerica_usa_nm055 ['REMOVE'] FE23_northamerica_usa_nv060 ['REMOVE'] FE23_northamerica_usa_nv518 ['KEEP'] FE23_northamerica_usa_nv521 ['KEEP'] FE23_northamerica_usa_nv513 ['REMOVE'] FE23_northamerica_usa_nv520 ['KEEP'] ch2k_ZI15MER01_2 ['KEEP'] ch2k_ZI15MER01_4 ['REMOVE'] ch2k_CO03PAL03_6 ['REMOVE'] iso2k_511 ['KEEP'] ch2k_CO03PAL02_8 ['REMOVE'] iso2k_509 ['KEEP'] ch2k_CO03PAL07_14 ['REMOVE'] iso2k_521 ['KEEP'] ch2k_RE18CAY01_30 ['REMOVE'] iso2k_917 ['KEEP'] ch2k_KU99HOU01_40 ['REMOVE', 'REMOVE'] iso2k_786 ['KEEP', 'REMOVE'] iso2k_788 ['KEEP', 'KEEP'] ch2k_CA14TIM01_64 ['REMOVE'] iso2k_473 ['KEEP'] ch2k_SW98STP01_86 ['REMOVE'] iso2k_50 ['KEEP'] ch2k_CO03PAL01_110 ['REMOVE'] iso2k_507 ['KEEP'] ch2k_CH98PIR01_116 ['REMOVE'] iso2k_1229 ['KEEP'] ch2k_XI17HAI01_128 ['KEEP', 'REMOVE'] ch2k_XI17HAI01_136 ['REMOVE', 'REMOVE'] iso2k_1762 ['KEEP', 'KEEP'] ch2k_XI17HAI01_130 ['KEEP'] ch2k_XI17HAI01_134 ['REMOVE'] ch2k_DE14DTO03_140 ['REMOVE'] ch2k_DE14DTO01_148 ['REMOVE'] ch2k_EV18ROC01_184 ['KEEP'] ch2k_EV18ROC01_186 ['REMOVE'] ch2k_CA13SAP01_188 ['REMOVE'] iso2k_569 ['KEEP'] ch2k_HE13MIS01_194 ['REMOVE', 'REMOVE'] iso2k_211 ['REMOVE'] iso2k_213 ['REMOVE'] ch2k_ZI15IMP02_200 ['KEEP'] ch2k_ZI15IMP02_202 ['REMOVE'] ch2k_PF04PBA01_204 ['REMOVE', 'REMOVE'] iso2k_1701 ['KEEP', 'KEEP'] iso2k_1704 ['KEEP', 'REMOVE'] ch2k_CO03PAL05_212 ['REMOVE'] iso2k_515 ['KEEP'] ch2k_ZI15TAN01_278 ['KEEP'] ch2k_ZI15TAN01_280 ['REMOVE'] ch2k_GU99NAU01_314 ['KEEP', 'REMOVE'] iso2k_702 ['REMOVE', 'REMOVE'] iso2k_705 ['KEEP', 'KEEP'] ch2k_CO03PAL10_324 ['REMOVE'] iso2k_519 ['KEEP'] ch2k_ZI15IMP01_328 ['KEEP'] ch2k_ZI15IMP01_330 ['REMOVE'] ch2k_RO19YUC01_338 ['KEEP'] ch2k_RO19YUC01_340 ['REMOVE'] ch2k_CO03PAL09_358 ['REMOVE'] iso2k_525 ['KEEP'] ch2k_BA04FIJ02_382 ['REMOVE'] iso2k_52 ['KEEP'] ch2k_CO03PAL06_386 ['REMOVE'] iso2k_517 ['KEEP'] ch2k_CO93TAR01_408 ['REMOVE'] iso2k_539 ['KEEP'] ch2k_CO00MAL01_412 ['REMOVE'] ch2k_DE13HAI01_424 ['KEEP', 'REMOVE'] ch2k_DE13HAI01_432 ['REMOVE', 'REMOVE'] iso2k_1643 ['KEEP', 'KEEP'] ch2k_DE13HAI01_426 ['KEEP'] ch2k_DE13HAI01_430 ['REMOVE'] ch2k_ZI15CLE01_438 ['KEEP'] ch2k_ZI15CLE01_440 ['REMOVE'] ch2k_CO03PAL04_452 ['REMOVE'] iso2k_513 ['KEEP'] ch2k_FL18DTO01_460 ['REMOVE'] ch2k_FL18DTO02_554 ['REMOVE'] ch2k_DU94URV01_468 ['KEEP'] ch2k_DU94URV01_470 ['REMOVE'] ch2k_CO03PAL08_472 ['REMOVE'] iso2k_523 ['KEEP'] ch2k_LI99CLI01_486 ['REMOVE'] iso2k_1571 ['KEEP'] ch2k_ZI15BUN01_488 ['KEEP'] ch2k_ZI15BUN01_490 ['REMOVE'] ch2k_WU13TON01_504 ['KEEP'] ch2k_WU13TON01_506 ['REMOVE'] ch2k_KI14PAR01_510 ['KEEP'] ch2k_KI14PAR01_518 ['REMOVE'] ch2k_KI14PAR01_512 ['KEEP'] ch2k_KI14PAR01_516 ['REMOVE'] ch2k_BA04FIJ01_558 ['REMOVE'] iso2k_55 ['KEEP'] iso2k_120 ['KEEP'] sisal_253.0_171 ['REMOVE'] iso2k_140 ['KEEP'] sisal_278.0_184 ['REMOVE'] iso2k_236 ['KEEP'] sisal_205.0_141 ['REMOVE'] iso2k_380 ['KEEP'] sisal_446.0_292 ['REMOVE'] iso2k_399 ['KEEP', 'KEEP'] iso2k_806 ['REMOVE', 'KEEP'] iso2k_811 ['REMOVE', 'REMOVE'] iso2k_533 ['REMOVE'] sisal_115.0_69 ['KEEP'] iso2k_546 ['KEEP'] iso2k_549 ['REMOVE'] iso2k_547 ['KEEP'] iso2k_550 ['REMOVE'] iso2k_772 ['KEEP'] iso2k_775 ['REMOVE'] iso2k_873 ['KEEP'] sisal_471.0_314 ['REMOVE'] iso2k_1107 ['KEEP', 'KEEP'] iso2k_1817 ['REMOVE', 'KEEP'] sisal_271.0_174 ['REMOVE', 'REMOVE'] iso2k_1178 ['KEEP'] sisal_201.0_133 ['REMOVE'] iso2k_1283 ['KEEP'] iso2k_1286 ['REMOVE'] iso2k_1288 ['KEEP'] sisal_329.0_213 ['REMOVE'] iso2k_1291 ['KEEP'] sisal_330.0_215 ['REMOVE'] iso2k_1495 ['KEEP'] sisal_305.0_199 ['REMOVE'] iso2k_1504 ['REMOVE'] sisal_113.0_66 ['KEEP'] iso2k_1820 ['KEEP'] sisal_272.0_177 ['REMOVE'] iso2k_1823 ['KEEP'] sisal_273.0_179 ['REMOVE'] iso2k_1848 ['KEEP'] iso2k_1855 ['REMOVE'] iso2k_1850 ['KEEP'] iso2k_1851 ['REMOVE'] sisal_46.0_18 ['KEEP'] sisal_47.0_21 ['REMOVE'] sisal_46.0_19 ['KEEP'] sisal_47.0_22 ['REMOVE'] sisal_46.0_20 ['KEEP'] sisal_47.0_23 ['REMOVE'] sisal_430.0_270 ['KEEP'] sisal_896.0_531 ['REMOVE'] sisal_430.0_271 ['KEEP'] sisal_896.0_533 ['REMOVE']
Save all the duplicate details in one dictionary, which will be used in the duplicate free dataframe (final output) df_dupfree to provide details on the duplicate detection process (duplicateDetails).
# Collect duplicate details for each record
dup_details = dup.collect_dup_details(df_decisions, header)
Note that any one record can appear more than once and have multiple decisions associated with it (e.g. 'REMOVE', 'KEEP' or 'COMPOSITE').
In order to remove the duplicates we need to implement the following steps:
- Records to be REMOVED. Remove all records from the dataframe which are associated with the decision 'REMOVE' and save in
df_cleaned - Records to be COMPOSITED. Create compounds of the records and save in
df_composite - Now check for records which have both 'REMOVE' and 'COMPOSITE' associated. These are potentially remaining duplicates. Here, the operator is once again asked to make decisions and run a 'mini' version of the duplicate workflow.
1. Records to be REMOVED¶
First simply remove all the records to which the decision 'REMOVE' and/or 'COMPOSITE' applies to and store in df_cleaned, while all 'REMOVE' type records are stored in df_duplica (for later inspection).
# load the records TO BE REMOVED OR COMPOSITED
remove_IDs = list(df_decisions['datasetId 1'][np.isin(df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
remove_IDs += list(df_decisions['datasetId 2'][np.isin(df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
remove_IDs = np.unique(remove_IDs)
df_duplica = df.loc[remove_IDs, 'datasetId'] # df containing only records which were removed
df_cleaned = df.drop(remove_IDs) # df freed from 'REMOVE' type duplicates
print(f'Removed {len(df_duplica)} REMOVE or COMPOSITE type records.')
print(f'REMOVE type duplicate free dataset contains {len(df_cleaned)} records.')
print('Removed the following IDs:', remove_IDs)
print(df.name)
Removed 367 REMOVE or COMPOSITE type records. REMOVE type duplicate free dataset contains 4780 records. Removed the following IDs: ['FE23_australia_newz003' 'FE23_australia_newz008' 'FE23_australia_newz014' 'FE23_australia_newz018' 'FE23_australia_newz019' 'FE23_europe_swed019w' 'FE23_europe_swed021w' 'FE23_northamerica_canada_cana100' 'FE23_northamerica_canada_cana105' 'FE23_northamerica_canada_cana162' 'FE23_northamerica_canada_cana168w' 'FE23_northamerica_mexico_mexi022' 'FE23_northamerica_usa_ca066' 'FE23_northamerica_usa_ca067' 'FE23_northamerica_usa_ca512' 'FE23_northamerica_usa_ca535' 'FE23_northamerica_usa_me017' 'FE23_northamerica_usa_mo009' 'FE23_northamerica_usa_mt113' 'FE23_northamerica_usa_nj002' 'FE23_northamerica_usa_nm055' 'FE23_northamerica_usa_nv060' 'FE23_northamerica_usa_nv512' 'FE23_northamerica_usa_nv513' 'FE23_southamerica_arge085' 'FE23_southamerica_chil017' 'ch2k_AS05GUA01_302' 'ch2k_BA04FIJ01_558' 'ch2k_BA04FIJ02_382' 'ch2k_CA07FLI01_400' 'ch2k_CA13SAP01_188' 'ch2k_CA14TIM01_64' 'ch2k_CH98PIR01_116' 'ch2k_CO00MAL01_412' 'ch2k_CO03PAL01_110' 'ch2k_CO03PAL02_8' 'ch2k_CO03PAL03_6' 'ch2k_CO03PAL04_452' 'ch2k_CO03PAL05_212' 'ch2k_CO03PAL06_386' 'ch2k_CO03PAL07_14' 'ch2k_CO03PAL08_472' 'ch2k_CO03PAL09_358' 'ch2k_CO03PAL10_324' 'ch2k_CO93TAR01_408' 'ch2k_DA06MAF01_78' 'ch2k_DA06MAF02_104' 'ch2k_DE13HAI01_424' 'ch2k_DE13HAI01_430' 'ch2k_DE13HAI01_432' 'ch2k_DE14DTO01_148' 'ch2k_DE14DTO03_140' 'ch2k_DR99ABR01_264' 'ch2k_DR99ABR01_266' 'ch2k_DU94URV01_470' 'ch2k_EV18ROC01_186' 'ch2k_FE09OGA01_304' 'ch2k_FE18RUS01_492' 'ch2k_FL18DTO01_460' 'ch2k_FL18DTO02_554' 'ch2k_GO12SBV01_396' 'ch2k_GU99NAU01_314' 'ch2k_HE08LRA01_76' 'ch2k_HE10GUA01_244' 'ch2k_HE13MIS01_194' 'ch2k_KI04MCV01_366' 'ch2k_KI14PAR01_516' 'ch2k_KI14PAR01_518' 'ch2k_KU00NIN01_150' 'ch2k_KU99HOU01_40' 'ch2k_LI06FIJ01_582' 'ch2k_LI06RAR01_12' 'ch2k_LI06RAR02_270' 'ch2k_LI94SEC01_436' 'ch2k_LI99CLI01_486' 'ch2k_MO06PED01_226' 'ch2k_NA09MAL01_84' 'ch2k_NU11PAL01_52' 'ch2k_OS14UCP01_236' 'ch2k_PF04PBA01_204' 'ch2k_QU06RAB01_144' 'ch2k_QU96ESV01_422' 'ch2k_RE18CAY01_30' 'ch2k_RO19YUC01_340' 'ch2k_SW98STP01_86' 'ch2k_TU01DEP01_450' 'ch2k_TU95MAD01_24' 'ch2k_UR00MAI01_22' 'ch2k_WU13TON01_506' 'ch2k_XI17HAI01_128' 'ch2k_XI17HAI01_134' 'ch2k_XI17HAI01_136' 'ch2k_ZI04IFR01_26' 'ch2k_ZI14IFR02_524' 'ch2k_ZI14TUR01_480' 'ch2k_ZI14TUR01_482' 'ch2k_ZI15BUN01_490' 'ch2k_ZI15CLE01_440' 'ch2k_ZI15IMP01_330' 'ch2k_ZI15IMP02_202' 'ch2k_ZI15MER01_4' 'ch2k_ZI15TAN01_280' 'iso2k_1068' 'iso2k_1069' 'iso2k_1286' 'iso2k_1504' 'iso2k_1554' 'iso2k_1704' 'iso2k_1817' 'iso2k_1851' 'iso2k_1855' 'iso2k_211' 'iso2k_213' 'iso2k_298' 'iso2k_299' 'iso2k_505' 'iso2k_533' 'iso2k_549' 'iso2k_550' 'iso2k_579' 'iso2k_58' 'iso2k_702' 'iso2k_775' 'iso2k_786' 'iso2k_806' 'iso2k_811' 'iso2k_98' 'pages2k_0' 'pages2k_1004' 'pages2k_1026' 'pages2k_1048' 'pages2k_107' 'pages2k_1089' 'pages2k_1108' 'pages2k_1116' 'pages2k_1147' 'pages2k_1156' 'pages2k_1160' 'pages2k_1209' 'pages2k_122' 'pages2k_1252' 'pages2k_1274' 'pages2k_1293' 'pages2k_132' 'pages2k_1325' 'pages2k_1360' 'pages2k_1365' 'pages2k_1370' 'pages2k_1420' 'pages2k_1444' 'pages2k_1488' 'pages2k_1490' 'pages2k_1491' 'pages2k_1497' 'pages2k_1519' 'pages2k_1520' 'pages2k_1547' 'pages2k_1566' 'pages2k_158' 'pages2k_1605' 'pages2k_1623' 'pages2k_1636' 'pages2k_1688' 'pages2k_1692' 'pages2k_1703' 'pages2k_171' 'pages2k_1712' 'pages2k_1720' 'pages2k_1741' 'pages2k_1750' 'pages2k_1771' 'pages2k_1804' 'pages2k_1859' 'pages2k_1861' 'pages2k_1880' 'pages2k_1891' 'pages2k_1918' 'pages2k_1923' 'pages2k_1932' 'pages2k_1942' 'pages2k_1973' 'pages2k_1980' 'pages2k_1983' 'pages2k_1985' 'pages2k_1991' 'pages2k_1994' 'pages2k_2013' 'pages2k_203' 'pages2k_2042' 'pages2k_2059' 'pages2k_2085' 'pages2k_2094' 'pages2k_2103' 'pages2k_2110' 'pages2k_2149' 'pages2k_2150' 'pages2k_2156' 'pages2k_2214' 'pages2k_2220' 'pages2k_2226' 'pages2k_225' 'pages2k_2265' 'pages2k_2290' 'pages2k_2300' 'pages2k_2303' 'pages2k_2309' 'pages2k_2311' 'pages2k_2319' 'pages2k_2344' 'pages2k_2361' 'pages2k_238' 'pages2k_2402' 'pages2k_242' 'pages2k_2430' 'pages2k_2473' 'pages2k_2502' 'pages2k_2510' 'pages2k_2514' 'pages2k_2517' 'pages2k_2534' 'pages2k_2538' 'pages2k_2561' 'pages2k_258' 'pages2k_2595' 'pages2k_2596' 'pages2k_2599' 'pages2k_2604' 'pages2k_2606' 'pages2k_2609' 'pages2k_2612' 'pages2k_2613' 'pages2k_2617' 'pages2k_263' 'pages2k_2634' 'pages2k_2660' 'pages2k_267' 'pages2k_2677' 'pages2k_2703' 'pages2k_271' 'pages2k_2722' 'pages2k_273' 'pages2k_2750' 'pages2k_2755' 'pages2k_2759' 'pages2k_2795' 'pages2k_2798' 'pages2k_281' 'pages2k_2830' 'pages2k_2843' 'pages2k_2901' 'pages2k_2906' 'pages2k_2922' 'pages2k_294' 'pages2k_2953' 'pages2k_2959' 'pages2k_2976' 'pages2k_3002' 'pages2k_3030' 'pages2k_3033' 'pages2k_3038' 'pages2k_305' 'pages2k_3064' 'pages2k_3068' 'pages2k_307' 'pages2k_3085' 'pages2k_3107' 'pages2k_3108' 'pages2k_3132' 'pages2k_3134' 'pages2k_315' 'pages2k_317' 'pages2k_3170' 'pages2k_3179' 'pages2k_3191' 'pages2k_3196' 'pages2k_3202' 'pages2k_323' 'pages2k_3236' 'pages2k_3239' 'pages2k_3243' 'pages2k_3263' 'pages2k_3266' 'pages2k_3307' 'pages2k_3313' 'pages2k_3342' 'pages2k_3352' 'pages2k_3372' 'pages2k_3374' 'pages2k_3404' 'pages2k_3419' 'pages2k_3503' 'pages2k_3524' 'pages2k_3550' 'pages2k_3552' 'pages2k_3554' 'pages2k_3571' 'pages2k_3583' 'pages2k_3599' 'pages2k_3609' 'pages2k_3631' 'pages2k_3642' 'pages2k_385' 'pages2k_387' 'pages2k_395' 'pages2k_397' 'pages2k_409' 'pages2k_418' 'pages2k_421' 'pages2k_433' 'pages2k_445' 'pages2k_446' 'pages2k_462' 'pages2k_468' 'pages2k_474' 'pages2k_477' 'pages2k_478' 'pages2k_486' 'pages2k_495' 'pages2k_50' 'pages2k_500' 'pages2k_541' 'pages2k_565' 'pages2k_583' 'pages2k_592' 'pages2k_6' 'pages2k_610' 'pages2k_626' 'pages2k_63' 'pages2k_691' 'pages2k_730' 'pages2k_736' 'pages2k_800' 'pages2k_81' 'pages2k_818' 'pages2k_83' 'pages2k_830' 'pages2k_831' 'pages2k_842' 'pages2k_857' 'pages2k_88' 'pages2k_881' 'pages2k_895' 'pages2k_900' 'pages2k_94' 'pages2k_940' 'pages2k_945' 'pages2k_960' 'pages2k_976' 'pages2k_982' 'sisal_201.0_133' 'sisal_205.0_141' 'sisal_253.0_171' 'sisal_271.0_174' 'sisal_272.0_177' 'sisal_273.0_179' 'sisal_278.0_184' 'sisal_294.0_194' 'sisal_305.0_199' 'sisal_329.0_213' 'sisal_330.0_215' 'sisal_446.0_292' 'sisal_47.0_21' 'sisal_47.0_22' 'sisal_47.0_23' 'sisal_471.0_314' 'sisal_896.0_531' 'sisal_896.0_533'] all_merged
# add columns on decision process to df_cleaned:
df_cleaned['duplicateDetails']='N/A'
for ID in dup_details:
if ID in df_cleaned.index:
if df_cleaned.at[ID, 'duplicateDetails']=='N/A':
df_cleaned.at[ID, 'duplicateDetails']=dup_details[ID]
else: df_cleaned.at[ID, 'duplicateDetails']+=dup_details[ID]
# df_cleaned[df_cleaned[ 'duplicateDetails']!='N/A'].at['ch2k_DE14DTO03_140', 'duplicateDetails']
2. Records to be COMPOSITED¶
Now identify all the records to which the decision 'COMPOSITE' applies to, create composites and store in df_composite.
# add the column 'duplicateDetails' to df, in case it does not exist
if 'duplicateDetails' not in df.columns: df['duplicateDetails']='N/A'
# load the records to be composited
comp_ID_pairs = df_decisions[(df_decisions['Decision 1']=='COMPOSITE')&(df_decisions['Decision 2']=='COMPOSITE')]
# create new composite data and metadata from the pairs
# loop through the composite pairs and check metadata
df_composite = dup.join_composites_metadata(df, comp_ID_pairs, df_decisions, header)
FE23_europe_swed019w FE23_europe_swed021w -------------------------------------------------------------------------------- Metadata different for >>>geo_siteName<<< in: FE23_europe_swed019w (Torneträskr+f.,Bartoli) and FE23_europe_swed021w (Torneträskfos.,Bartoli). -------------------------------------------------------------------------------- Metadata different for >>>interpretation_variable<<< in: FE23_europe_swed019w (NOT temperature NOT moisture) and FE23_europe_swed021w (N/A).
saved figure in /home/jupyter-lluecke/dod2k/figs//all_merged/dup_detection//composite_FE23_europe_swed019w_FE23_europe_swed021w.pdf
print(df_composite.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 1 non-null object 1 geo_meanElev 1 non-null float32 2 geo_meanLat 1 non-null float32 3 geo_meanLon 1 non-null float32 4 geo_siteName 1 non-null object 5 paleoData_proxy 1 non-null object 6 yearUnits 1 non-null object 7 interpretation_variable 1 non-null object 8 interpretation_direction 1 non-null object 9 interpretation_seasonality 1 non-null object 10 paleoData_values 1 non-null object 11 year 1 non-null object 12 dataSetName 1 non-null object 13 originalDatabase 1 non-null object 14 originalDataURL 1 non-null object 15 paleoData_notes 1 non-null object 16 interpretation_variableDetail 1 non-null object 17 datasetId 1 non-null object 18 paleoData_units 1 non-null object 19 duplicateDetails 1 non-null object dtypes: float32(3), object(17) memory usage: 280.0+ bytes None
3. Check for overlap between duplicates¶
The duplicate free dataframe is obtained by joining
df_cleaned(duplicate free as all records with decisionREMOVEand/orCOMPOSITEremoved) anddf_composite(dupicate free as duplicates are composited)
However, there might still be duplicates between the two dataframes when a record has been associated with more than 1 duplicate candidate pair.
The scenarios for duplicates appearing twice:
REMOVE/KEEPandCOMPOSITE:
- duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEP - duplicate pair
aandchave had the decisions assigned:a->COMPOSITE,c->COMPOSITE.
In this case, b and ac (the composite record of a and c) would be duplicates in the merged dataframe
2a. REMOVE/KEEP & KEEP/REMOVE:
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->KEEP,c->REMOVE.In this case
awould still be removed asREMOVEoverridesKEEPin the algorithm. So onlybwill be kept and no duplicates would remain.
2b. REMOVE/KEEP & REMOVE/KEEP
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->REMOVE,c->KEEP.In this case,
awould be removed, butbandcwill be kept and would be duplicates in the merged dataframe.
COMPOSITEx 2
- duplicate pair
aandbhave had the decisions assigned:a->COMPOSITE,b->COMPOSITE - duplicate pair
aandchave had the decisions assigned:a->COMPOSITE,c->COMPOSITE.
In this case, ab and ac would be duplicates in the merged dataframe.
REMOVE/KEEPandKEEP/KEEP
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->KEEP,c->KEEP.In this case
awould be removed,bandcwould be kept but ascis not a duplicate ofano duplicates would remain.
Therefore, we loop through the records in the joined dataframe which have been associated with multiple duplicates.
Merge the composites and the dataframe freed from REMOVE and COMPOSITE type records
# initiate the loop
tmp_df_dupfree = pd.concat([df_cleaned, df_composite])
tmp_df_dupfree.index = tmp_df_dupfree['datasetId']
tmp_decisions = decisions.copy()
Create a loop which implements a mini duplicate detection on all the records which have multiple decisions associated.
# Simple composite tracking for debugging only
composite_log = []
for ii in range(10):
tmp_df_dupfree.set_index('datasetId', inplace = True)
tmp_df_dupfree['datasetId']=tmp_df_dupfree.index
print('-'*20)
print(f'ITERATION # {ii}')
multiple_dups = []
for id in tmp_decisions.keys():
if len(tmp_decisions[id]) > 1:
if id not in multiple_dups:
multiple_dups.append(id)
if len(multiple_dups) > 0:
# Check which of the multiple duplicate IDs are still in the dataframe
multiple_dups_new = []
current_ids = set(tmp_df_dupfree.index) # Get all current IDs as a set
for id in multiple_dups:
if id in current_ids: # Simple membership check
multiple_dups_new.append(id)
if len(multiple_dups_new) > 0:
print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
else:
print('No more multiple duplicates found in current dataframe.')
print('SUCCESS!!')
break
else:
print('No more multiple duplicates.')
print('SUCCESS!!')
break
# Now we create a small dataframe which needs to be checked for duplicates.
df_check = tmp_df_dupfree.copy()[np.isin(tmp_df_dupfree['datasetId'], multiple_dups_new)]
print('Check dataframe: ')
df_check.name = 'tmp'
df_check.index = range(len(df_check))
print(df_check.info())
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
break
else:
yn=''
while yn not in ['y', 'n']:
yn = input('Do you want to continue with the decision process for duplicates? [y/n]')
if yn=='n': break
df_check = dup.define_hierarchy(df_check)
dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True,
remove_identicals=False, backup=False, comment=False)
# implement the decisions
tmp_df_decisions = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
tmp_dup_details = dup.provide_dup_details(tmp_df_decisions, header)
# decisions
tmp_decisions = {}
for ind in tmp_df_decisions.index:
id1, id2 = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
for id, dec in zip([id1, id2], [dec1, dec2]):
if id not in tmp_decisions: tmp_decisions[id] = []
tmp_decisions[id]+=[dec]
df_check.set_index('datasetId', inplace = True)
df_check['datasetId']=df_check.index
#drop all REMOVE or COMPOSITE types
tmp_remove_IDs = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
# # composite the
tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
if len(tmp_comp_ID_pairs) > 0:
for _, pair in tmp_comp_ID_pairs.iterrows():
id1, id2 = pair['datasetId 1'], pair['datasetId 2']
# Log what was composited
composite_log.append({
'iteration': ii,
'composited': [id1, id2],
'new_id': f"{id1}_{id2}_composite" # or however you generate it
})
# # create new composite data and metadata from the pairs
# # loop through the composite pairs and check metadata
tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
print('--'*20)
print('Finished iteration.')
print('NEW DATAFRAME:')
print(tmp_df_dupfree.info())
print('--'*20)
print('--'*20)
if ii==19: print('STILL DUPLICATES PRESENT AFTER MULTIPLE ITERATIONS! REVISE DECISION PROCESS!!')
print('--'*20)
print(f"Created {len(composite_log)} composites across all iterations")
-------------------- ITERATION # 0 WARNING! Decisions associated with 54 multiple duplicates in the new dataframe. Please review these records below and run through a further duplicate detection workflow until no more duplicates are found. Check dataframe: <class 'pandas.core.frame.DataFrame'> RangeIndex: 54 entries, 0 to 53 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 54 non-null object 1 dataSetName 54 non-null object 2 geo_meanElev 53 non-null float32 3 geo_meanLat 54 non-null float32 4 geo_meanLon 54 non-null float32 5 geo_siteName 54 non-null object 6 interpretation_direction 54 non-null object 7 interpretation_seasonality 54 non-null object 8 interpretation_variable 54 non-null object 9 interpretation_variableDetail 54 non-null object 10 originalDataURL 54 non-null object 11 originalDatabase 54 non-null object 12 paleoData_notes 54 non-null object 13 paleoData_proxy 54 non-null object 14 paleoData_sensorSpecies 54 non-null object 15 paleoData_units 54 non-null object 16 paleoData_values 54 non-null object 17 paleoData_variableName 54 non-null object 18 year 54 non-null object 19 yearUnits 54 non-null object 20 duplicateDetails 54 non-null object 21 datasetId 54 non-null object dtypes: float32(3), object(19) memory usage: 8.8+ KB None tmp Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/54 Progress: 10/54 Progress: 20/54 Progress: 30/54 Progress: 40/54 Progress: 50/54 ============================================================ Saved indices, IDs, distances, correlations in data/tmp/dup_detection/ ============================================================ Detected 0 possible duplicates in tmp. ============================================================ ============================================================ Indices: IDs: ============================================================ SUCCESS!! NO MORE DUPLICATES DETECTED!! Created 0 composites across all iterations
4. Create duplicate free dataframe¶
Do another duplicate search on the whole dataframe to make sure there are no duplicates present anymore.
tmp_df_dupfree.set_index('datasetId', inplace = True)
tmp_df_dupfree['datasetId']=tmp_df_dupfree.index
# Now we create a dataframe which needs to be checked for duplicates.
df_check = tmp_df_dupfree.copy()
df_check.name = 'tmp'
df_check.index = range(len(df_check))
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
else:
df_check = dup.define_hierarchy(df_check)
dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True,
remove_identicals=False, backup=False)
# implement the decisions
tmp_df_decisions = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
tmp_dup_details = dup.provide_dup_details(tmp_df_decisions, header)
# decisions
tmp_decisions = {}
for ind in tmp_df_decisions.index:
id1, id2 = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
for id, dec in zip([id1, id2], [dec1, dec2]):
if id not in tmp_decisions: tmp_decisions[id] = []
tmp_decisions[id]+=[dec]
df_check.set_index('datasetId', inplace = True)
df_check['datasetId']=df_check.index
#drop all REMOVE or COMPOSITE types
tmp_remove_IDs = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
# # composite the
tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
# # create new composite data and metadata from the pairs
# # loop through the composite pairs and check metadata
tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
print('Finished last round of duplicate removal.')
print('Potentially run through this cell again to check for remaining duplicates.')
tmp Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/4781 Progress: 10/4781 Progress: 20/4781 Progress: 30/4781 Progress: 40/4781 Progress: 50/4781 Progress: 60/4781 Progress: 70/4781 Progress: 80/4781 Progress: 90/4781 Progress: 100/4781 Progress: 110/4781 Progress: 120/4781 Progress: 130/4781 Progress: 140/4781 Progress: 150/4781 Progress: 160/4781 Progress: 170/4781 Progress: 180/4781 Progress: 190/4781 Progress: 200/4781 Progress: 210/4781 Progress: 220/4781 Progress: 230/4781 Progress: 240/4781 Progress: 250/4781 Progress: 260/4781 Progress: 270/4781 Progress: 280/4781 Progress: 290/4781 Progress: 300/4781 Progress: 310/4781 Progress: 320/4781 Progress: 330/4781 Progress: 340/4781 Progress: 350/4781 Progress: 360/4781 Progress: 370/4781 Progress: 380/4781 Progress: 390/4781 Progress: 400/4781 Progress: 410/4781 Progress: 420/4781 Progress: 430/4781 Progress: 440/4781 Progress: 450/4781 Progress: 460/4781 Progress: 470/4781 Progress: 480/4781 Progress: 490/4781 Progress: 500/4781 Progress: 510/4781 Progress: 520/4781 Progress: 530/4781 Progress: 540/4781 Progress: 550/4781 Progress: 560/4781 Progress: 570/4781 Progress: 580/4781 Progress: 590/4781 Progress: 600/4781 Progress: 610/4781 Progress: 620/4781 Progress: 630/4781 Progress: 640/4781 Progress: 650/4781 Progress: 660/4781 Progress: 670/4781 Progress: 680/4781 Progress: 690/4781 Progress: 700/4781 Progress: 710/4781 Progress: 720/4781 Progress: 730/4781 Progress: 740/4781 Progress: 750/4781 Progress: 760/4781 Progress: 770/4781 Progress: 780/4781 Progress: 790/4781 Progress: 800/4781 Progress: 810/4781 Progress: 820/4781 Progress: 830/4781 Progress: 840/4781 Progress: 850/4781 Progress: 860/4781 Progress: 870/4781 Progress: 880/4781 Progress: 890/4781 Progress: 900/4781 Progress: 910/4781 Progress: 920/4781 Progress: 930/4781 Progress: 940/4781 Progress: 950/4781 Progress: 960/4781 Progress: 970/4781 Progress: 980/4781 Progress: 990/4781 Progress: 1000/4781 Progress: 1010/4781 Progress: 1020/4781 Progress: 1030/4781 Progress: 1040/4781 Progress: 1050/4781 Progress: 1060/4781 Progress: 1070/4781 Progress: 1080/4781 Progress: 1090/4781 Progress: 1100/4781 Progress: 1110/4781 Progress: 1120/4781 Progress: 1130/4781 Progress: 1140/4781 Progress: 1150/4781 Progress: 1160/4781 Progress: 1170/4781 Progress: 1180/4781 Progress: 1190/4781 Progress: 1200/4781 Progress: 1210/4781 Progress: 1220/4781 Progress: 1230/4781 Progress: 1240/4781 Progress: 1250/4781 Progress: 1260/4781 Progress: 1270/4781 Progress: 1280/4781 Progress: 1290/4781 Progress: 1300/4781 Progress: 1310/4781 Progress: 1320/4781 Progress: 1330/4781 Progress: 1340/4781 Progress: 1350/4781 Progress: 1360/4781 Progress: 1370/4781 Progress: 1380/4781 Progress: 1390/4781 Progress: 1400/4781 Progress: 1410/4781 Progress: 1420/4781 Progress: 1430/4781 Progress: 1440/4781 Progress: 1450/4781 Progress: 1460/4781 Progress: 1470/4781 Progress: 1480/4781 Progress: 1490/4781 Progress: 1500/4781 Progress: 1510/4781 Progress: 1520/4781 Progress: 1530/4781 Progress: 1540/4781 Progress: 1550/4781 Progress: 1560/4781 Progress: 1570/4781 Progress: 1580/4781 Progress: 1590/4781 Progress: 1600/4781 Progress: 1610/4781 Progress: 1620/4781 Progress: 1630/4781 Progress: 1640/4781 Progress: 1650/4781 Progress: 1660/4781 Progress: 1670/4781 Progress: 1680/4781 Progress: 1690/4781 Progress: 1700/4781 Progress: 1710/4781 Progress: 1720/4781 Progress: 1730/4781 Progress: 1740/4781 Progress: 1750/4781 Progress: 1760/4781 Progress: 1770/4781 Progress: 1780/4781 Progress: 1790/4781 Progress: 1800/4781 Progress: 1810/4781 Progress: 1820/4781 Progress: 1830/4781 Progress: 1840/4781 Progress: 1850/4781 Progress: 1860/4781 Progress: 1870/4781 Progress: 1880/4781 Progress: 1890/4781 Progress: 1900/4781 Progress: 1910/4781 Progress: 1920/4781 Progress: 1930/4781 Progress: 1940/4781 Progress: 1950/4781 Progress: 1960/4781 Progress: 1970/4781 Progress: 1980/4781 Progress: 1990/4781 Progress: 2000/4781 Progress: 2010/4781 Progress: 2020/4781 Progress: 2030/4781 Progress: 2040/4781 Progress: 2050/4781 Progress: 2060/4781 Progress: 2070/4781 Progress: 2080/4781 Progress: 2090/4781 Progress: 2100/4781 Progress: 2110/4781 Progress: 2120/4781 Progress: 2130/4781 Progress: 2140/4781 Progress: 2150/4781 Progress: 2160/4781 Progress: 2170/4781 Progress: 2180/4781 Progress: 2190/4781 Progress: 2200/4781 Progress: 2210/4781 Progress: 2220/4781 Progress: 2230/4781 Progress: 2240/4781 Progress: 2250/4781 Progress: 2260/4781 Progress: 2270/4781 Progress: 2280/4781 Progress: 2290/4781 Progress: 2300/4781 Progress: 2310/4781 Progress: 2320/4781 Progress: 2330/4781 Progress: 2340/4781 Progress: 2350/4781 Progress: 2360/4781 Progress: 2370/4781 Progress: 2380/4781 Progress: 2390/4781 Progress: 2400/4781 Progress: 2410/4781 Progress: 2420/4781 Progress: 2430/4781 Progress: 2440/4781 Progress: 2450/4781 Progress: 2460/4781 Progress: 2470/4781 Progress: 2480/4781 Progress: 2490/4781 Progress: 2500/4781 Progress: 2510/4781 Progress: 2520/4781 Progress: 2530/4781 Progress: 2540/4781 Progress: 2550/4781 Progress: 2560/4781 Progress: 2570/4781 Progress: 2580/4781 Progress: 2590/4781 Progress: 2600/4781 Progress: 2610/4781 Progress: 2620/4781 Progress: 2630/4781 Progress: 2640/4781 Progress: 2650/4781 Progress: 2660/4781 Progress: 2670/4781 Progress: 2680/4781 Progress: 2690/4781 Progress: 2700/4781 Progress: 2710/4781 Progress: 2720/4781 Progress: 2730/4781 Progress: 2740/4781 Progress: 2750/4781 Progress: 2760/4781 Progress: 2770/4781 Progress: 2780/4781 Progress: 2790/4781 Progress: 2800/4781 Progress: 2810/4781 Progress: 2820/4781 Progress: 2830/4781 Progress: 2840/4781 Progress: 2850/4781 Progress: 2860/4781 Progress: 2870/4781 Progress: 2880/4781 Progress: 2890/4781 Progress: 2900/4781 Progress: 2910/4781 Progress: 2920/4781 Progress: 2930/4781 Progress: 2940/4781 Progress: 2950/4781 Progress: 2960/4781 Progress: 2970/4781 Progress: 2980/4781 Progress: 2990/4781 Progress: 3000/4781 Progress: 3010/4781 Progress: 3020/4781 Progress: 3030/4781 Progress: 3040/4781 Progress: 3050/4781 Progress: 3060/4781 Progress: 3070/4781 Progress: 3080/4781 Progress: 3090/4781 Progress: 3100/4781 Progress: 3110/4781 Progress: 3120/4781 Progress: 3130/4781 Progress: 3140/4781 Progress: 3150/4781 Progress: 3160/4781 Progress: 3170/4781 Progress: 3180/4781 Progress: 3190/4781 Progress: 3200/4781 Progress: 3210/4781 Progress: 3220/4781 Progress: 3230/4781 Progress: 3240/4781 Progress: 3250/4781 Progress: 3260/4781 Progress: 3270/4781 Progress: 3280/4781 Progress: 3290/4781 Progress: 3300/4781 Progress: 3310/4781 Progress: 3320/4781 Progress: 3330/4781 Progress: 3340/4781 Progress: 3350/4781 Progress: 3360/4781 Progress: 3370/4781 Progress: 3380/4781 Progress: 3390/4781 Progress: 3400/4781 Progress: 3410/4781 Progress: 3420/4781 Progress: 3430/4781 Progress: 3440/4781 Progress: 3450/4781 Progress: 3460/4781 Progress: 3470/4781 Progress: 3480/4781 Progress: 3490/4781 Progress: 3500/4781 Progress: 3510/4781 Progress: 3520/4781 Progress: 3530/4781 Progress: 3540/4781 Progress: 3550/4781 Progress: 3560/4781 Progress: 3570/4781 Progress: 3580/4781 Progress: 3590/4781 Progress: 3600/4781 Progress: 3610/4781 Progress: 3620/4781 Progress: 3630/4781 Progress: 3640/4781 Progress: 3650/4781 Progress: 3660/4781 Progress: 3670/4781 Progress: 3680/4781 Progress: 3690/4781 Progress: 3700/4781 Progress: 3710/4781 Progress: 3720/4781 Progress: 3730/4781 Progress: 3740/4781 Progress: 3750/4781 Progress: 3760/4781 Progress: 3770/4781 Progress: 3780/4781 Progress: 3790/4781 Progress: 3800/4781 Progress: 3810/4781 Progress: 3820/4781 Progress: 3830/4781 Progress: 3840/4781 Progress: 3850/4781 Progress: 3860/4781 Progress: 3870/4781 Progress: 3880/4781 Progress: 3890/4781 Progress: 3900/4781 Progress: 3910/4781 Progress: 3920/4781 Progress: 3930/4781
/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
Progress: 3940/4781 Progress: 3950/4781 Progress: 3960/4781 Progress: 3970/4781 Progress: 3980/4781 Progress: 3990/4781 Progress: 4000/4781 Progress: 4010/4781 Progress: 4020/4781 Progress: 4030/4781 Progress: 4040/4781 Progress: 4050/4781 Progress: 4060/4781 Progress: 4070/4781 Progress: 4080/4781 Progress: 4090/4781 Progress: 4100/4781 Progress: 4110/4781 Progress: 4120/4781 Progress: 4130/4781 Progress: 4140/4781 Progress: 4150/4781 Progress: 4160/4781 Progress: 4170/4781 Progress: 4180/4781 Progress: 4190/4781 Progress: 4200/4781 Progress: 4210/4781 Progress: 4220/4781 Progress: 4230/4781 Progress: 4240/4781 Progress: 4250/4781 Progress: 4260/4781 Progress: 4270/4781 Progress: 4280/4781 Progress: 4290/4781 Progress: 4300/4781 Progress: 4310/4781 Progress: 4320/4781 Progress: 4330/4781 Progress: 4340/4781 Progress: 4350/4781 Progress: 4360/4781 Progress: 4370/4781 Progress: 4380/4781 Progress: 4390/4781 Progress: 4400/4781 Progress: 4410/4781 Progress: 4420/4781 Progress: 4430/4781 Progress: 4440/4781 Progress: 4450/4781 Progress: 4460/4781 Progress: 4470/4781 Progress: 4480/4781 Progress: 4490/4781 Progress: 4500/4781 Progress: 4510/4781 Progress: 4520/4781 Progress: 4530/4781 Progress: 4540/4781 Progress: 4550/4781 Progress: 4560/4781 Progress: 4570/4781 Progress: 4580/4781 Progress: 4590/4781 Progress: 4600/4781 Progress: 4610/4781 Progress: 4620/4781 Progress: 4630/4781 Progress: 4640/4781 Progress: 4650/4781 Progress: 4660/4781 Progress: 4670/4781 Progress: 4680/4781 Progress: 4690/4781 Progress: 4700/4781 Progress: 4710/4781 Progress: 4720/4781 Progress: 4730/4781 Progress: 4740/4781 Progress: 4750/4781 Progress: 4760/4781 Progress: 4770/4781 Progress: 4780/4781 ============================================================ Saved indices, IDs, distances, correlations in data/tmp/dup_detection/ ============================================================ Detected 0 possible duplicates in tmp. ============================================================ ============================================================ Indices: IDs: ============================================================ SUCCESS!! NO MORE DUPLICATES DETECTED!!
df_dupfree = tmp_df_dupfree
print(df_dupfree.info())
<class 'pandas.core.frame.DataFrame'> Index: 4781 entries, pages2k_5 to dod2k_composite_z_FE23_europe_swed019w_FE23_europe_swed021w Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4781 non-null object 1 dataSetName 4781 non-null object 2 geo_meanElev 4699 non-null float32 3 geo_meanLat 4781 non-null float32 4 geo_meanLon 4781 non-null float32 5 geo_siteName 4781 non-null object 6 interpretation_direction 4781 non-null object 7 interpretation_seasonality 4781 non-null object 8 interpretation_variable 4781 non-null object 9 interpretation_variableDetail 4781 non-null object 10 originalDataURL 4781 non-null object 11 originalDatabase 4781 non-null object 12 paleoData_notes 4781 non-null object 13 paleoData_proxy 4781 non-null object 14 paleoData_sensorSpecies 4780 non-null object 15 paleoData_units 4781 non-null object 16 paleoData_values 4781 non-null object 17 paleoData_variableName 4780 non-null object 18 year 4781 non-null object 19 yearUnits 4781 non-null object 20 duplicateDetails 4781 non-null object 21 datasetId 4781 non-null object dtypes: float32(3), object(19) memory usage: 803.1+ KB None
Save duplicate free dataframe¶
Sort the columns and assign a name to the dataframe which is used for saving purposes (determines directory and filename). Make sure that date and operator initials initials are used in the name.
df_dupfree = df_dupfree[sorted(df_dupfree.columns)]
if df.name=='all_merged':
yn = input('Would you like to save as dod2k_v2.0?')
if yn=='y':
df_dupfree.name = 'dod2k_v2.0'
else:
df_dupfree.name =f'{df.name}_{initials}_{date}_dupfree'
os.makedirs(f'data/{df_dupfree.name}/', exist_ok=True)
df_dupfree.info()
print(df_dupfree.name)
<class 'pandas.core.frame.DataFrame'> Index: 4781 entries, pages2k_5 to dod2k_composite_z_FE23_europe_swed019w_FE23_europe_swed021w Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4781 non-null object 1 dataSetName 4781 non-null object 2 datasetId 4781 non-null object 3 duplicateDetails 4781 non-null object 4 geo_meanElev 4699 non-null float32 5 geo_meanLat 4781 non-null float32 6 geo_meanLon 4781 non-null float32 7 geo_siteName 4781 non-null object 8 interpretation_direction 4781 non-null object 9 interpretation_seasonality 4781 non-null object 10 interpretation_variable 4781 non-null object 11 interpretation_variableDetail 4781 non-null object 12 originalDataURL 4781 non-null object 13 originalDatabase 4781 non-null object 14 paleoData_notes 4781 non-null object 15 paleoData_proxy 4781 non-null object 16 paleoData_sensorSpecies 4780 non-null object 17 paleoData_units 4781 non-null object 18 paleoData_values 4781 non-null object 19 paleoData_variableName 4780 non-null object 20 year 4781 non-null object 21 yearUnits 4781 non-null object dtypes: float32(3), object(19) memory usage: 803.1+ KB dod2k_v2.0
save pickle¶
# save concatenate dataframe as db_merged
df_dupfree.to_pickle(f'data/{df_dupfree.name}/{df_dupfree.name}_compact.pkl')
save csv¶
# save to a list of csv files (metadata, data, year)
utf.write_compact_dataframe_to_csv(df_dupfree)
METADATA: datasetId, archiveType, dataSetName, duplicateDetails, geo_meanElev, geo_meanLat, geo_meanLon, geo_siteName, interpretation_direction, interpretation_seasonality, interpretation_variable, interpretation_variableDetail, originalDataURL, originalDatabase, paleoData_notes, paleoData_proxy, paleoData_sensorSpecies, paleoData_units, paleoData_variableName, yearUnits Saved to /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_%s.csv
# load dataframe
print(utf.load_compact_dataframe_from_csv(df_dupfree.name).info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4781 entries, 0 to 4780 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4781 non-null object 1 dataSetName 4781 non-null object 2 datasetId 4781 non-null object 3 duplicateDetails 4781 non-null object 4 geo_meanElev 4699 non-null float32 5 geo_meanLat 4781 non-null float32 6 geo_meanLon 4781 non-null float32 7 geo_siteName 4781 non-null object 8 interpretation_direction 4781 non-null object 9 interpretation_seasonality 4781 non-null object 10 interpretation_variable 4781 non-null object 11 interpretation_variableDetail 4781 non-null object 12 originalDataURL 4781 non-null object 13 originalDatabase 4781 non-null object 14 paleoData_notes 4781 non-null object 15 paleoData_proxy 4781 non-null object 16 paleoData_sensorSpecies 4781 non-null object 17 paleoData_units 4781 non-null object 18 paleoData_values 4781 non-null object 19 paleoData_variableName 4781 non-null object 20 year 4781 non-null object 21 yearUnits 4781 non-null object dtypes: float32(3), object(19) memory usage: 765.8+ KB None
# write header with operator information as README txt file
file = open(f'data/{df_dupfree.name}/{df_dupfree.name}_dupfree_README.txt', 'w')
for line in header:
file.write(line+'\n')
file.close()
fn = utf.find(df_dupfree.name, f'data/{df_dupfree.name}')
print(fn)
if fn != []:
print('----------------------------------------------------')
print('Sucessfully finished the duplicate finalising process!'.upper())
print('----------------------------------------------------')
print('Saved the final output files in:')
print()
for ff in fn:
print(' '+os.getcwd()+'/%s.'%ff)
print()
print('The duplicate detection process is now finished and the duplicate free database is ready to use!')
else:
print('!!!!!!!!!!!!WARNING!!!!!!!!!!!')
print(f'Final output file is missing at data/{df_dupfree.name}.')
print()
print('Please re-run the notebook to complete duplicate finalising process.')
['data/dod2k_v2.0/dod2k_v2.0_compact.pkl', 'data/dod2k_v2.0/dod2k_v2.0_compact_metadata.csv', 'data/dod2k_v2.0/dod2k_v2.0_compact_year.csv', 'data/dod2k_v2.0/dod2k_v2.0_compact_paleoData_values.csv', 'data/dod2k_v2.0/dod2k_v2.0_README.txt', 'data/dod2k_v2.0/dod2k_v2.0_dupfree_README.txt'] ---------------------------------------------------- SUCESSFULLY FINISHED THE DUPLICATE FINALISING PROCESS! ---------------------------------------------------- Saved the final output files in: /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact.pkl. /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_metadata.csv. /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_year.csv. /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_paleoData_values.csv. /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_README.txt. /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_dupfree_README.txt. The duplicate detection process is now finished and the duplicate free database is ready to use!
Summary and summary plots of datasets¶
Import plotting libraries
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec as GS
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from dod2k_utilities import ut_plot as uplt # contains plotting functions
#%% print some info about the data
db_types = df_cleaned['originalDatabase'].unique()
col = uplt.get_colours(range(len(db_types)), 'tab10', 0, len(db_types))
#col = ['tab:blue','tab:green', 'tab:grey', 'tab:pink', 'tab:orange']
counts = []
ticks = []
colours = []
for ii, db in enumerate(db_types):
cc = df_cleaned['originalDatabase'][(df_cleaned['originalDatabase']==db)].count()
counts += [cc]
ticks += [db.split('(Ocn_103')[0]]
colours += [col[ii]]
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(8,4), dpi=200)
ax = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('database')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
plt.title('original database')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_databases', add='%s/'%df_dupfree.name)
saved figure in /figs/dod2k_v2.0//SF_removed_recs_barchart_databases.pdf
#%% print some info about the data
proxy_types = df_cleaned['paleoData_proxy'].unique()
archive_types = df_cleaned['archiveType'].unique()
print(proxy_types)
print(archive_types)
col = uplt.get_colours(range(0,len(archive_types)), 'Accent', -1, len(archive_types))
counts = []
ticks = []
colours = []
for ii, at in enumerate(archive_types):
proxy_types = df_cleaned['paleoData_proxy'][df_cleaned['archiveType']==at].unique()
for pt in proxy_types:
cc = df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count()
# print('%25s'%pt+': '+str(cc))
counts += [cc]
ticks += [at+': '+pt]
colours += [col[ii]]
['ring width' 'ARSTAN' 'maximum latewood density' 'reflectance' 'd18O' 'd13C' 'Sr/Ca' 'Mg/Ca' 'temperature' 'historical' 'varve thickness' 'ice melt' 'alkenone' 'chironomid' 'Uk37' 'borehole' 'pollen' 'dinocyst' 'count' 'concentration' 'chrysophyte assemblage' 'calcification rate' 'foraminifera' 'dust' 'chloride' 'sulfate' 'nitrate' 'thickness' 'dD' 'TEX86' 'effective precipitation' 'diatom' 'multiproxy' 'humidification index' 'accumulation rate' 'sodium' 'growth rate'] ['Wood' 'Coral' 'LakeSediment' 'MarineSediment' 'Documents' 'GlacierIce' 'Borehole' 'Sclerosponge' 'Speleothem' 'Other' 'GroundIce' 'MolluskShell']
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(12, 6), dpi=150)
ax = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('proxy type')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
ax.set_yscale('log')
plt.title('removed proxy types')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_proxytypes', add='%s/'%df_dupfree.name)
saved figure in /figs/dod2k_v2.0//SF_removed_recs_barchart_proxytypes.pdf
#%% plot the spatial distribution of the removeed records
proxy_lats = df_cleaned['geo_meanLat'].values
proxy_lons = df_cleaned['geo_meanLon'].values
# plots the map
fig = plt.figure(figsize=(10, 5), dpi=200)
grid = GS(1, 3)
ax = plt.subplot(grid[:, -2:], projection=ccrs.Robinson()) # create axis with Robinson projection of globe
ax.stock_img()
ax.add_feature(cfeature.LAND) # adds land features
ax.coastlines() # adds coastline features
mt = 'ov^<>pP*XDd'*10 # generates string of marker types
archive_marker = {aa: mm for aa, mm in zip(archive_types, mt)} # attributes marker type to each archive type
archive_colour = {aa: cc for aa, cc in zip(archive_types, col)}
# loop through the data to generate a scatter plot of each data record:
# 1st loop: go through archive types individually (determines marker type)
# 2nd loop: through paleo proxy types attributed to the specific archive, which is colour coded
for jj, at in enumerate(archive_types):
arch_mask = df_cleaned['archiveType']==at
arch_proxy_types = np.unique(df_cleaned['paleoData_proxy'][arch_mask])
for ii, pt in enumerate(arch_proxy_types):
pt_mask = df_cleaned['paleoData_proxy']==pt
at_mask = df_cleaned['archiveType']==at
plt.scatter(proxy_lons[pt_mask&at_mask], proxy_lats[pt_mask&at_mask],
transform=ccrs.PlateCarree(), zorder=999,
marker=mt[ii], color=archive_colour[at],
label=at+': '+pt+' ($n=%d$)'% df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count(),
lw=.5, ec='k')
plt.title('removed proxy types')
plt.legend(bbox_to_anchor=(0.03,1.1), ncol=2, fontsize=9, framealpha=0)
grid.tight_layout(fig)
utf.figsave(fig, 'SF_removed_spatial', add='%s/'%df_dupfree.name)
/tmp/ipykernel_1169431/3409790064.py:36: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations grid.tight_layout(fig)
saved figure in /figs/dod2k_v2.0//SF_removed_spatial.pdf