Duplicate detection - step 3: remove true duplicates¶

This notebook runs the third part of the duplicate detection algorithm on a dataframe with the following columns:

archiveType (used for duplicate detection algorithm)
dataSetName
datasetId
geo_meanElev (used for duplicate detection algorithm)
geo_meanLat (used for duplicate detection algorithm)
geo_meanLon (used for duplicate detection algorithm)
geo_siteName (used for duplicate detection algorithm)
interpretation_direction
interpretation_seasonality
interpretation_variable
interpretation_variableDetails
originalDataURL
originalDatabase
paleoData_notes
paleoData_proxy (used for duplicate detection algorithm)
paleoData_units
paleoData_values (used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)
paleoData_variableName
year (used for duplicate detection algorithm)
yearUnits
This interactive notebook (dup_removal.ipynb) removes the duplicates flagged in dup_detection.ipynb, following the decisions made in dup_decision.ipynb. The decisions include
removal of redundant duplicates
creation of composites

Based on the operator decisions as specified in data/DATABASENAME/duplicate_detection/duplicate_decisions_DATABASENAME_AUTHORINITIALS_YY-MM-DD.csv.

Ultimately a duplicate free dataframe is saved under

data/DATABASENAME/DATABASENAME_dupfree.pkl
data/DATABASENAME/DATABASENAME_dupfree_data.csv
data/DATABASENAME/DATABASENAME_dupfree_year.csv
data/DATABASENAME/DATABASENAME_dupfree_metadata.csv

10/11/2025 by LL: tidied up with revised data organisation and prepared for documentation 02/12/2024 by LL: Modified the compositing process for metadata to fix bugs and make it more user friendly. Added some extra information to the bottom of the file (prior to the figures).

22/10/2024 by LL: add the composite option for duplicates (create z-scores and average over shared time period) 30/09/2024 by LL: keep all original database values for removeed duplicates with more than one original database

Author: Lucie Luecke, created 27/9/2024

Intialisation¶

Set up working environment¶

Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).

In [1]:

Copied!





%load_ext autoreload
%autoreload 2

import sys
import os
from pathlib import Path

# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')

# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
    os.chdir(repo_root)
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
    print(f"Working directory matches repo root. ")
%load_ext autoreload
%autoreload 2

import sys
import os
from pathlib import Path

# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')

# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
    os.chdir(repo_root)
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
    print(f"Working directory matches repo root. ")

Repo root: /home/jupyter-lluecke/dod2k
Working directory matches repo root.

In [2]:

Copied!





import pandas as pd
import numpy as np
import datetime

from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions
import pandas as pd
import numpy as np
import datetime

from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions

Load dataset¶

Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).

The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be

pages2k
ch2k
iso2k
sisal
fe23

for the individual databases, or

all_merged

to load the merged database of all individual databases, or can be any user defined compact dataframe.

In [3]:

Copied!





# load dataframe
db_name='all_merged' 
# db_name='dup_test' 
df = utf.load_compact_dataframe_from_csv(db_name)

print(df.info())
df.name = db_name
# load dataframe
db_name='all_merged' 
# db_name='dup_test' 
df = utf.load_compact_dataframe_from_csv(db_name)

print(df.info())
df.name = db_name

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5147 entries, 0 to 5146
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    5147 non-null   object 
 1   dataSetName                    5147 non-null   object 
 2   datasetId                      5147 non-null   object 
 3   geo_meanElev                   5048 non-null   float32
 4   geo_meanLat                    5147 non-null   float32
 5   geo_meanLon                    5147 non-null   float32
 6   geo_siteName                   5147 non-null   object 
 7   interpretation_direction       5147 non-null   object 
 8   interpretation_seasonality     5147 non-null   object 
 9   interpretation_variable        5147 non-null   object 
 10  interpretation_variableDetail  5147 non-null   object 
 11  originalDataURL                5147 non-null   object 
 12  originalDatabase               5147 non-null   object 
 13  paleoData_notes                5147 non-null   object 
 14  paleoData_proxy                5147 non-null   object 
 15  paleoData_sensorSpecies        5147 non-null   object 
 16  paleoData_units                5147 non-null   object 
 17  paleoData_values               5147 non-null   object 
 18  paleoData_variableName         5147 non-null   object 
 19  year                           5147 non-null   object 
 20  yearUnits                      5147 non-null   object 
dtypes: float32(3), object(18)
memory usage: 784.2+ KB
None

Set datasetId as dataframe index to reliably identify the duplicates:

In [4]:

Copied!

df.set_index('datasetId', inplace = True)
df['datasetId']=df.index
df.set_index('datasetId', inplace = True)
df['datasetId']=df.index

Input operator's credentials¶

In order to keep maximum transparency and reproduceability, put in the operator's credentials here.

These details are used to flag the intermediate output files and provided along with the final duplicate free dataset.

In [5]:

Copied!





# initials = 'LL'
# fullname = 'Lucie Luecke'
# email    = 'ljluec1@st-andrews.ac.uk'
initials = 'MNE'
fullname = 'Michael Evans'
email    = 'mnevans@umd.edu'
operator_details = [initials, fullname, email]
# initials = 'LL'
# fullname = 'Lucie Luecke'
# email    = 'ljluec1@st-andrews.ac.uk'
initials = 'MNE'
fullname = 'Michael Evans'
email    = 'mnevans@umd.edu'
operator_details = [initials, fullname, email]

Apply duplicate decisions to dataframe¶

Load duplicate decisions from csv¶

Please specify the date of the decision process below. The decision output file is then loaded from data/DBNAME/dup_detection/dup_decisions_DBNAME_INITIALS_DATE.csv.

In [6]:

Copied!





# date = str(datetime.datetime.utcnow())[2:10]
# date='25-12-11' # for Lucie's dup_test decisions
date='25-12-19' # for Mike's latest all_merged decisions
# date='25-12-11' # for Lucie's latest all_merged decisions

filename      = f'data/{df.name}/dup_detection/dup_decisions_{df.name}_{initials}_{date}'
data, header  = dup.read_csv(filename, header=True)
df_decisions  = pd.read_csv(filename+'.csv', header=5)

for hh in header:
    print(hh)
print(df_decisions.columns)
print(df.name)
# date = str(datetime.datetime.utcnow())[2:10]
# date='25-12-11' # for Lucie's dup_test decisions
date='25-12-19' # for Mike's latest all_merged decisions
# date='25-12-11' # for Lucie's latest all_merged decisions

filename      = f'data/{df.name}/dup_detection/dup_decisions_{df.name}_{initials}_{date}'
data, header  = dup.read_csv(filename, header=True)
df_decisions  = pd.read_csv(filename+'.csv', header=5)

for hh in header:
    print(hh)
print(df_decisions.columns)
print(df.name)

 Decisions for duplicate candidate pairs. 
 Operated by Michael Evans (MNE)
 E-Mail: mnevans@umd.edu
 Created on: 2025-12-19 23:14:43.155509 (UTC)
 updated for PAGES2k v2.2.0, Iso2k v1.1.0, CoralHydro2k v1.0.1 all from https://lipdverse.org/ . Checked: Cobb et al (2003, 2013); Dee et al (2020) composite record for Palmyra is updated (https://lipdverse.org/data/XHtmwaqI1qgs2CQxrXgK/1_0_8//CO03COPM.jsonld) in Pages2k v2.2.0 lipdverse compilation. 
Index(['index 1', 'index 2', 'figure path', 'datasetId 1', 'datasetId 2',
       'originalDatabase 1', 'originalDatabase 2', 'geo_siteName 1',
       'geo_siteName 2', 'geo_meanLat 1', 'geo_meanLat 2', 'geo_meanLon 1',
       'geo_meanLon 2', 'geo_meanElevation 1', 'geo_meanElevation 2',
       'archiveType 1', 'archiveType 2', 'paleoData_proxy 1',
       'paleoData_proxy 2', 'originalDataURL 1', 'originalDataURL 2', 'year 1',
       'year 2', 'Decision 1', 'Decision 2', 'Decision type',
       'Decision comment'],
      dtype='object')
all_merged

Collect the decisions associated with each candidate ID (not as pair), to identify records associated with multiple decisions.

In [7]:

Copied!

# Collect decisions for each record
decisions = dup.collect_record_decisions(df_decisions)
# Collect decisions for each record
decisions = dup.collect_record_decisions(df_decisions)

Show the decisions associated with each individual record

In [8]:

Copied!

for rr, dd in decisions.items():
    print(rr, dd)
for rr, dd in decisions.items():
    print(rr, dd)

pages2k_0 ['REMOVE', 'REMOVE', 'REMOVE']
iso2k_296 ['KEEP', 'KEEP', 'KEEP']
iso2k_298 ['KEEP', 'REMOVE', 'KEEP']
iso2k_299 ['KEEP', 'REMOVE', 'REMOVE']
pages2k_6 ['REMOVE']
FE23_northamerica_usa_az555 ['KEEP']
pages2k_50 ['REMOVE']
FE23_northamerica_canada_cana091 ['KEEP']
pages2k_62 ['KEEP']
pages2k_63 ['REMOVE']
pages2k_81 ['REMOVE', 'REMOVE']
ch2k_HE08LRA01_76 ['KEEP', 'REMOVE']
iso2k_1813 ['KEEP', 'KEEP']
pages2k_83 ['REMOVE']
iso2k_1916 ['KEEP']
pages2k_85 ['KEEP']
pages2k_88 ['REMOVE']
pages2k_94 ['REMOVE']
FE23_northamerica_canada_cana153 ['KEEP']
pages2k_107 ['REMOVE']
FE23_northamerica_usa_ak046 ['KEEP']
pages2k_121 ['KEEP']
pages2k_122 ['REMOVE']
pages2k_132 ['REMOVE']
FE23_northamerica_canada_cana225 ['KEEP']
pages2k_158 ['REMOVE']
FE23_northamerica_usa_wa069 ['KEEP']
pages2k_171 ['REMOVE']
FE23_northamerica_usa_wy021 ['KEEP']
pages2k_203 ['REMOVE']
iso2k_826 ['KEEP']
pages2k_225 ['REMOVE']
FE23_northamerica_usa_nv512 ['KEEP', 'REMOVE']
pages2k_238 ['REMOVE']
iso2k_1044 ['KEEP']
pages2k_242 ['REMOVE', 'REMOVE']
ch2k_LI06FIJ01_582 ['KEEP', 'REMOVE']
iso2k_353 ['KEEP', 'KEEP']
pages2k_258 ['REMOVE']
iso2k_1498 ['KEEP']
pages2k_263 ['REMOVE']
iso2k_1322 ['KEEP']
pages2k_267 ['KEEP', 'REMOVE']
iso2k_58 ['REMOVE', 'REMOVE']
iso2k_1068 ['KEEP', 'REMOVE']
pages2k_271 ['REMOVE', 'REMOVE']
ch2k_FE18RUS01_492 ['KEEP', 'REMOVE']
iso2k_1861 ['KEEP', 'KEEP']
pages2k_273 ['REMOVE']
FE23_asia_russ130w ['KEEP']
pages2k_281 ['REMOVE']
FE23_northamerica_canada_cana155 ['KEEP']
pages2k_294 ['REMOVE']
FE23_northamerica_usa_ak021 ['KEEP']
pages2k_305 ['REMOVE']
pages2k_309 ['KEEP']
pages2k_307 ['REMOVE']
pages2k_311 ['KEEP']
pages2k_315 ['REMOVE']
iso2k_362 ['KEEP']
pages2k_317 ['REMOVE', 'REMOVE']
ch2k_NA09MAL01_84 ['KEEP', 'REMOVE']
iso2k_1754 ['KEEP', 'KEEP']
pages2k_323 ['REMOVE']
FE23_northamerica_canada_cana210 ['KEEP']
pages2k_385 ['REMOVE', 'REMOVE']
ch2k_FE09OGA01_304 ['KEEP', 'REMOVE']
iso2k_1922 ['KEEP', 'KEEP']
pages2k_387 ['REMOVE']
ch2k_FE09OGA01_306 ['KEEP']
pages2k_395 ['REMOVE', 'REMOVE']
ch2k_CA07FLI01_400 ['KEEP', 'REMOVE']
iso2k_1057 ['KEEP', 'KEEP']
pages2k_397 ['REMOVE']
ch2k_CA07FLI01_402 ['KEEP']
pages2k_409 ['REMOVE', 'REMOVE']
ch2k_QU96ESV01_422 ['KEEP', 'REMOVE']
iso2k_218 ['KEEP', 'KEEP']
pages2k_414 ['KEEP']
pages2k_418 ['REMOVE']
pages2k_417 ['KEEP']
pages2k_421 ['REMOVE']
pages2k_427 ['KEEP']
pages2k_433 ['REMOVE']
pages2k_435 ['KEEP']
pages2k_842 ['REMOVE']
pages2k_444 ['KEEP', 'KEEP']
pages2k_445 ['REMOVE', 'KEEP']
pages2k_446 ['REMOVE', 'REMOVE']
pages2k_462 ['REMOVE', 'REMOVE']
ch2k_OS14UCP01_236 ['KEEP', 'REMOVE']
iso2k_350 ['KEEP', 'KEEP']
pages2k_468 ['KEEP', 'REMOVE']
pages2k_3550 ['REMOVE', 'REMOVE']
FE23_asia_russ137w ['KEEP', 'KEEP']
pages2k_472 ['KEEP', 'KEEP']
pages2k_474 ['REMOVE', 'REMOVE']
pages2k_477 ['REMOVE', 'REMOVE']
pages2k_478 ['REMOVE']
iso2k_1846 ['KEEP']
pages2k_486 ['REMOVE']
FE23_northamerica_usa_ca609 ['KEEP']
pages2k_495 ['REMOVE', 'REMOVE']
ch2k_LI06RAR01_12 ['KEEP', 'REMOVE']
iso2k_1502 ['KEEP', 'KEEP']
pages2k_500 ['REMOVE', 'REMOVE']
ch2k_AS05GUA01_302 ['KEEP', 'REMOVE']
iso2k_1559 ['KEEP', 'KEEP']
pages2k_541 ['REMOVE']
iso2k_404 ['KEEP']
pages2k_543 ['KEEP']
pages2k_976 ['REMOVE']
pages2k_565 ['REMOVE']
iso2k_998 ['KEEP']
pages2k_583 ['REMOVE']
FE23_northamerica_usa_mt116 ['KEEP']
pages2k_592 ['REMOVE', 'REMOVE']
ch2k_LI06RAR02_270 ['KEEP', 'REMOVE']
iso2k_1500 ['KEEP', 'KEEP']
pages2k_610 ['REMOVE']
iso2k_1199 ['KEEP']
pages2k_626 ['REMOVE']
FE23_northamerica_usa_wa071 ['KEEP']
pages2k_691 ['REMOVE']
FE23_northamerica_canada_cana062 ['KEEP']
pages2k_730 ['REMOVE']
iso2k_396 ['KEEP']
pages2k_736 ['REMOVE']
FE23_northamerica_usa_wy024 ['KEEP']
pages2k_800 ['REMOVE']
FE23_northamerica_canada_cana234 ['KEEP']
pages2k_818 ['REMOVE']
iso2k_488 ['KEEP']
pages2k_827 ['KEEP']
pages2k_830 ['REMOVE']
pages2k_831 ['KEEP', 'REMOVE']
pages2k_2220 ['REMOVE', 'REMOVE']
FE23_asia_russ127w ['KEEP', 'KEEP']
pages2k_857 ['REMOVE']
FE23_northamerica_usa_ut511 ['KEEP']
pages2k_881 ['REMOVE']
iso2k_1010 ['KEEP', 'KEEP']
pages2k_893 ['KEEP', 'KEEP']
pages2k_895 ['REMOVE', 'KEEP']
pages2k_900 ['REMOVE', 'REMOVE']
pages2k_940 ['REMOVE', 'REMOVE', 'REMOVE']
ch2k_DR99ABR01_264 ['KEEP', 'KEEP', 'REMOVE']
ch2k_DR99ABR01_266 ['KEEP', 'REMOVE', 'REMOVE']
iso2k_91 ['KEEP', 'KEEP', 'KEEP']
pages2k_945 ['REMOVE']
iso2k_100 ['KEEP']
pages2k_960 ['REMOVE']
iso2k_641 ['KEEP']
pages2k_982 ['REMOVE']
FE23_northamerica_usa_or042 ['KEEP']
pages2k_1004 ['REMOVE']
iso2k_644 ['KEEP']
pages2k_1026 ['REMOVE']
FE23_northamerica_usa_az553 ['KEEP']
pages2k_1048 ['REMOVE']
iso2k_1212 ['KEEP']
pages2k_1089 ['REMOVE', 'REMOVE']
FE23_northamerica_usa_mt112 ['KEEP', 'KEEP']
FE23_northamerica_usa_mt113 ['KEEP', 'REMOVE']
pages2k_1108 ['REMOVE']
iso2k_1060 ['KEEP']
pages2k_1116 ['REMOVE']
FE23_northamerica_canada_cana170w ['KEEP']
pages2k_1147 ['KEEP', 'KEEP', 'REMOVE']
ch2k_DA06MAF01_78 ['REMOVE', 'REMOVE']
ch2k_DA06MAF02_104 ['REMOVE', 'REMOVE']
iso2k_1748 ['KEEP', 'KEEP', 'KEEP']
pages2k_1153 ['KEEP', 'KEEP']
pages2k_1156 ['REMOVE', 'KEEP']
pages2k_1160 ['REMOVE', 'REMOVE']
pages2k_1209 ['REMOVE']
FE23_northamerica_usa_co553 ['KEEP']
pages2k_1252 ['REMOVE']
FE23_northamerica_canada_cana096 ['KEEP']
pages2k_1274 ['REMOVE']
iso2k_1577 ['KEEP']
pages2k_1293 ['REMOVE']
iso2k_821 ['KEEP']
pages2k_1325 ['REMOVE']
FE23_northamerica_usa_wy030 ['KEEP']
pages2k_1360 ['REMOVE', 'REMOVE', 'REMOVE']
ch2k_UR00MAI01_22 ['KEEP', 'REMOVE', 'REMOVE']
iso2k_94 ['KEEP', 'KEEP', 'KEEP']
iso2k_98 ['KEEP', 'KEEP', 'REMOVE']
pages2k_1362 ['KEEP']
pages2k_1365 ['REMOVE']
pages2k_1370 ['REMOVE']
iso2k_1619 ['KEEP']
pages2k_1420 ['REMOVE']
FE23_northamerica_canada_cana111 ['KEEP']
pages2k_1442 ['KEEP']
pages2k_1444 ['REMOVE']
pages2k_1488 ['REMOVE', 'REMOVE', 'REMOVE', 'REMOVE']
pages2k_1628 ['KEEP', 'KEEP', 'KEEP', 'KEEP']
ch2k_NU11PAL01_52 ['KEEP', 'REMOVE', 'REMOVE', 'REMOVE']
iso2k_505 ['REMOVE', 'REMOVE', 'REMOVE', 'REMOVE']
iso2k_579 ['KEEP', 'REMOVE', 'KEEP', 'REMOVE']
pages2k_1490 ['REMOVE']
ch2k_NU11PAL01_54 ['KEEP']
pages2k_1491 ['REMOVE']
iso2k_575 ['KEEP']
pages2k_1497 ['REMOVE']
iso2k_1885 ['KEEP']
pages2k_1515 ['KEEP']
pages2k_1519 ['REMOVE']
pages2k_1520 ['REMOVE']
pages2k_1522 ['KEEP']
pages2k_1547 ['REMOVE']
iso2k_259 ['KEEP']
pages2k_1566 ['REMOVE']
FE23_northamerica_canada_cana231 ['KEEP']
pages2k_1605 ['REMOVE']
FE23_northamerica_usa_ca606 ['KEEP']
pages2k_1619 ['KEEP']
pages2k_1623 ['REMOVE']
pages2k_1636 ['REMOVE']
FE23_northamerica_usa_wa081 ['KEEP']
pages2k_1686 ['KEEP']
pages2k_1688 ['REMOVE']
pages2k_1692 ['REMOVE']
FE23_asia_mong012 ['KEEP']
pages2k_1703 ['REMOVE', 'REMOVE']
ch2k_MO06PED01_226 ['KEEP', 'REMOVE']
iso2k_629 ['KEEP', 'KEEP']
pages2k_1712 ['REMOVE']
iso2k_715 ['KEEP']
pages2k_1720 ['REMOVE']
iso2k_1579 ['KEEP']
pages2k_1741 ['REMOVE']
FE23_northamerica_usa_wa104 ['KEEP']
pages2k_1750 ['REMOVE', 'REMOVE']
iso2k_1856 ['KEEP', 'KEEP']
sisal_294.0_194 ['KEEP', 'REMOVE']
pages2k_1771 ['REMOVE']
ch2k_TU01LAI01_192 ['KEEP']
pages2k_1804 ['REMOVE']
FE23_northamerica_usa_me010 ['KEEP']
pages2k_1859 ['REMOVE', 'REMOVE']
ch2k_HE10GUA01_244 ['KEEP', 'REMOVE']
iso2k_1735 ['KEEP', 'KEEP']
pages2k_1861 ['REMOVE']
ch2k_HE10GUA01_246 ['KEEP']
pages2k_1880 ['REMOVE']
FE23_northamerica_usa_ak060 ['KEEP']
pages2k_1891 ['REMOVE']
pages2k_1893 ['KEEP']
pages2k_1918 ['REMOVE']
iso2k_102 ['KEEP']
pages2k_1920 ['KEEP']
pages2k_1923 ['REMOVE']
pages2k_1932 ['REMOVE']
pages2k_1934 ['KEEP']
pages2k_1942 ['REMOVE', 'REMOVE']
ch2k_ZI04IFR01_26 ['KEEP', 'REMOVE']
iso2k_257 ['KEEP', 'KEEP']
pages2k_1972 ['KEEP']
pages2k_1973 ['REMOVE']
pages2k_1976 ['KEEP']
pages2k_1980 ['REMOVE']
pages2k_1978 ['KEEP']
pages2k_1983 ['REMOVE']
pages2k_1985 ['REMOVE']
iso2k_1294 ['KEEP']
pages2k_1989 ['KEEP']
pages2k_1991 ['REMOVE']
pages2k_1994 ['REMOVE']
ch2k_DE12ANC01_258 ['KEEP']
pages2k_2013 ['REMOVE']
FE23_northamerica_canada_cana097 ['KEEP']
pages2k_2042 ['REMOVE', 'REMOVE']
ch2k_TU95MAD01_24 ['KEEP', 'REMOVE']
iso2k_20 ['KEEP', 'KEEP']
pages2k_2059 ['REMOVE']
FE23_northamerica_usa_ak058 ['KEEP']
pages2k_2085 ['REMOVE']
FE23_northamerica_canada_cana002 ['KEEP']
pages2k_2094 ['REMOVE', 'REMOVE']
ch2k_TU01DEP01_450 ['KEEP', 'REMOVE']
iso2k_1201 ['KEEP', 'KEEP']
pages2k_2098 ['KEEP']
pages2k_2103 ['REMOVE']
pages2k_2110 ['REMOVE']
FE23_northamerica_usa_co554 ['KEEP']
pages2k_2146 ['KEEP', 'KEEP']
pages2k_2149 ['REMOVE', 'KEEP']
pages2k_2150 ['REMOVE', 'REMOVE']
pages2k_2156 ['REMOVE']
FE23_northamerica_canada_cana169w ['KEEP']
pages2k_2214 ['REMOVE']
iso2k_1631 ['KEEP']
pages2k_2226 ['REMOVE']
FE23_asia_mong007w ['KEEP']
pages2k_2265 ['REMOVE']
FE23_northamerica_usa_ak070 ['KEEP']
pages2k_2287 ['KEEP']
pages2k_2290 ['REMOVE']
pages2k_2300 ['REMOVE']
ch2k_OS14RIP01_174 ['KEEP']
pages2k_2303 ['REMOVE']
FE23_asia_mong006 ['KEEP']
pages2k_2309 ['REMOVE']
ch2k_WE09ARR01_208 ['KEEP']
pages2k_2311 ['REMOVE']
ch2k_WE09ARR01_210 ['KEEP']
pages2k_2319 ['REMOVE']
FE23_northamerica_usa_ak6 ['KEEP']
pages2k_2339 ['KEEP']
pages2k_2344 ['REMOVE']
pages2k_2361 ['REMOVE']
FE23_northamerica_usa_wa097 ['KEEP']
pages2k_2402 ['REMOVE']
FE23_northamerica_usa_co586 ['KEEP']
pages2k_2430 ['REMOVE']
FE23_northamerica_canada_cana113 ['KEEP']
pages2k_2473 ['REMOVE']
FE23_northamerica_usa_wy022 ['KEEP']
pages2k_2500 ['KEEP']
pages2k_2502 ['REMOVE']
pages2k_2510 ['REMOVE']
iso2k_1626 ['KEEP']
pages2k_2514 ['REMOVE']
iso2k_1467 ['KEEP']
pages2k_2517 ['REMOVE']
iso2k_1130 ['KEEP']
pages2k_2534 ['REMOVE']
iso2k_1575 ['KEEP']
pages2k_2538 ['REMOVE']
iso2k_1862 ['KEEP']
pages2k_2561 ['REMOVE']
FE23_northamerica_canada_cana094 ['KEEP']
pages2k_2592 ['KEEP']
pages2k_2596 ['REMOVE']
pages2k_2595 ['REMOVE']
pages2k_2599 ['REMOVE']
pages2k_2604 ['KEEP', 'REMOVE']
pages2k_2606 ['REMOVE', 'REMOVE']
iso2k_1481 ['KEEP', 'KEEP']
pages2k_2607 ['KEEP', 'KEEP']
pages2k_2609 ['REMOVE', 'KEEP']
pages2k_2612 ['REMOVE', 'REMOVE']
pages2k_2613 ['REMOVE']
iso2k_1470 ['KEEP']
pages2k_2617 ['REMOVE']
iso2k_1573 ['KEEP']
pages2k_2634 ['REMOVE']
FE23_northamerica_usa_id013 ['KEEP']
pages2k_2660 ['REMOVE']
FE23_northamerica_usa_ak014 ['KEEP']
pages2k_2677 ['REMOVE']
FE23_northamerica_usa_wy023 ['KEEP']
pages2k_2703 ['REMOVE']
FE23_northamerica_usa_ak094 ['KEEP']
pages2k_2722 ['REMOVE']
FE23_northamerica_canada_cana238 ['KEEP']
pages2k_2750 ['REMOVE']
iso2k_1708 ['KEEP']
pages2k_2752 ['KEEP', 'KEEP']
pages2k_2755 ['REMOVE', 'KEEP']
pages2k_2759 ['REMOVE', 'REMOVE']
pages2k_2793 ['KEEP']
pages2k_2795 ['REMOVE', 'KEEP']
pages2k_2798 ['REMOVE', 'REMOVE']
pages2k_2796 ['KEEP']
pages2k_2830 ['REMOVE']
FE23_northamerica_mexico_mexi020 ['KEEP']
pages2k_2843 ['REMOVE']
FE23_northamerica_usa_wa083 ['KEEP']
pages2k_2899 ['KEEP']
pages2k_2901 ['REMOVE']
pages2k_2904 ['KEEP']
pages2k_2906 ['REMOVE']
pages2k_2922 ['REMOVE']
FE23_northamerica_usa_ca603 ['KEEP']
pages2k_2953 ['REMOVE']
iso2k_573 ['KEEP']
pages2k_2959 ['REMOVE']
FE23_northamerica_mexico_mexi043 ['KEEP']
pages2k_2976 ['REMOVE']
FE23_northamerica_usa_id008 ['KEEP']
pages2k_3002 ['REMOVE']
FE23_northamerica_usa_or043 ['KEEP']
pages2k_3028 ['KEEP', 'KEEP']
pages2k_3030 ['REMOVE', 'KEEP']
pages2k_3033 ['REMOVE', 'REMOVE']
pages2k_3038 ['REMOVE']
FE23_northamerica_usa_mt108 ['KEEP']
pages2k_3064 ['REMOVE']
iso2k_698 ['KEEP']
pages2k_3068 ['REMOVE', 'REMOVE']
ch2k_ZI14IFR02_522 ['KEEP', 'KEEP']
ch2k_ZI14IFR02_524 ['KEEP', 'REMOVE']
pages2k_3085 ['REMOVE', 'REMOVE', 'REMOVE']
ch2k_KU00NIN01_150 ['KEEP', 'REMOVE', 'REMOVE']
iso2k_1554 ['KEEP', 'KEEP', 'REMOVE']
iso2k_1556 ['KEEP', 'KEEP', 'KEEP']
pages2k_3107 ['REMOVE']
FE23_northamerica_usa_co552 ['KEEP', 'KEEP']
pages2k_3108 ['REMOVE']
pages2k_3132 ['REMOVE', 'REMOVE']
ch2k_QU06RAB01_144 ['KEEP', 'REMOVE']
iso2k_1311 ['KEEP', 'KEEP']
pages2k_3134 ['REMOVE']
ch2k_QU06RAB01_146 ['KEEP']
pages2k_3170 ['REMOVE']
FE23_australia_newz062 ['KEEP', 'KEEP']
pages2k_3179 ['REMOVE']
FE23_northamerica_usa_ak057 ['KEEP']
pages2k_3188 ['KEEP']
pages2k_3191 ['REMOVE']
pages2k_3196 ['REMOVE']
FE23_asia_mong011 ['KEEP']
pages2k_3202 ['REMOVE']
iso2k_1727 ['KEEP']
pages2k_3234 ['KEEP', 'KEEP']
pages2k_3236 ['REMOVE', 'KEEP']
pages2k_3239 ['REMOVE', 'REMOVE']
pages2k_3243 ['REMOVE']
iso2k_0 ['KEEP']
pages2k_3263 ['REMOVE']
iso2k_1264 ['KEEP']
pages2k_3266 ['REMOVE', 'REMOVE']
ch2k_GO12SBV01_396 ['KEEP', 'REMOVE']
iso2k_870 ['KEEP', 'KEEP']
pages2k_3307 ['REMOVE']
iso2k_339 ['KEEP']
pages2k_3313 ['REMOVE']
FE23_northamerica_usa_ca560 ['KEEP']
pages2k_3337 ['KEEP']
pages2k_3342 ['REMOVE']
pages2k_3352 ['REMOVE', 'REMOVE', 'REMOVE']
ch2k_ZI14TUR01_480 ['KEEP', 'KEEP', 'REMOVE']
ch2k_ZI14TUR01_482 ['KEEP', 'REMOVE', 'REMOVE']
iso2k_302 ['KEEP', 'KEEP', 'KEEP']
pages2k_3372 ['REMOVE', 'REMOVE']
ch2k_KI04MCV01_366 ['KEEP', 'REMOVE']
iso2k_155 ['KEEP', 'KEEP']
pages2k_3374 ['REMOVE']
ch2k_KI04MCV01_368 ['KEEP']
pages2k_3404 ['REMOVE']
FE23_northamerica_canada_cana029 ['KEEP']
pages2k_3417 ['KEEP']
pages2k_3419 ['REMOVE']
pages2k_3503 ['REMOVE']
FE23_northamerica_usa_wa072 ['KEEP']
pages2k_3524 ['REMOVE']
FE23_northamerica_usa_ak010 ['KEEP']
pages2k_3552 ['REMOVE']
iso2k_1581 ['KEEP']
pages2k_3554 ['REMOVE', 'REMOVE']
ch2k_LI94SEC01_436 ['KEEP', 'REMOVE']
iso2k_1124 ['KEEP', 'KEEP']
pages2k_3571 ['REMOVE']
iso2k_174 ['KEEP']
pages2k_3583 ['REMOVE']
FE23_northamerica_usa_co633 ['KEEP']
pages2k_3599 ['REMOVE', 'KEEP']
iso2k_1069 ['KEEP', 'REMOVE']
iso2k_1660 ['KEEP', 'KEEP']
pages2k_3609 ['REMOVE']
FE23_northamerica_canada_cana053 ['KEEP']
pages2k_3631 ['REMOVE']
iso2k_1530 ['KEEP']
pages2k_3642 ['REMOVE']
FE23_northamerica_usa_wy025 ['KEEP']
FE23_southamerica_arge016 ['KEEP']
FE23_southamerica_arge085 ['REMOVE']
FE23_northamerica_canada_cana100 ['REMOVE']
FE23_northamerica_canada_cana213 ['KEEP']
FE23_northamerica_canada_cana105 ['REMOVE']
FE23_northamerica_canada_cana217 ['KEEP']
FE23_northamerica_canada_cana116 ['KEEP']
FE23_northamerica_canada_cana168w ['REMOVE']
FE23_northamerica_canada_cana161 ['KEEP']
FE23_northamerica_canada_cana162 ['REMOVE']
FE23_southamerica_chil016 ['KEEP']
FE23_southamerica_chil017 ['REMOVE']
FE23_europe_swed019w ['COMPOSITE']
FE23_europe_swed021w ['COMPOSITE']
FE23_northamerica_mexico_mexi022 ['REMOVE']
FE23_northamerica_mexico_mexi023 ['KEEP']
FE23_australia_newz003 ['REMOVE']
FE23_australia_newz060 ['KEEP']
FE23_australia_newz008 ['REMOVE']
FE23_australia_newz092 ['KEEP']
FE23_australia_newz014 ['REMOVE']
FE23_australia_newz061 ['KEEP']
FE23_australia_newz018 ['REMOVE']
FE23_australia_newz019 ['REMOVE']
FE23_australia_newz063 ['KEEP']
FE23_northamerica_usa_ca066 ['REMOVE']
FE23_northamerica_usa_ca628 ['KEEP', 'KEEP']
FE23_northamerica_usa_ca067 ['REMOVE']
FE23_northamerica_usa_ca512 ['REMOVE']
FE23_northamerica_usa_ca613 ['KEEP']
FE23_northamerica_usa_ca535 ['REMOVE']
FE23_northamerica_usa_ca670 ['KEEP']
FE23_northamerica_usa_me017 ['REMOVE']
FE23_northamerica_usa_me018 ['KEEP']
FE23_northamerica_usa_mo ['KEEP']
FE23_northamerica_usa_mo009 ['REMOVE']
FE23_northamerica_usa_nj001 ['KEEP']
FE23_northamerica_usa_nj002 ['REMOVE']
FE23_northamerica_usa_nm024 ['KEEP']
FE23_northamerica_usa_nm055 ['REMOVE']
FE23_northamerica_usa_nv060 ['REMOVE']
FE23_northamerica_usa_nv518 ['KEEP']
FE23_northamerica_usa_nv521 ['KEEP']
FE23_northamerica_usa_nv513 ['REMOVE']
FE23_northamerica_usa_nv520 ['KEEP']
ch2k_ZI15MER01_2 ['KEEP']
ch2k_ZI15MER01_4 ['REMOVE']
ch2k_CO03PAL03_6 ['REMOVE']
iso2k_511 ['KEEP']
ch2k_CO03PAL02_8 ['REMOVE']
iso2k_509 ['KEEP']
ch2k_CO03PAL07_14 ['REMOVE']
iso2k_521 ['KEEP']
ch2k_RE18CAY01_30 ['REMOVE']
iso2k_917 ['KEEP']
ch2k_KU99HOU01_40 ['REMOVE', 'REMOVE']
iso2k_786 ['KEEP', 'REMOVE']
iso2k_788 ['KEEP', 'KEEP']
ch2k_CA14TIM01_64 ['REMOVE']
iso2k_473 ['KEEP']
ch2k_SW98STP01_86 ['REMOVE']
iso2k_50 ['KEEP']
ch2k_CO03PAL01_110 ['REMOVE']
iso2k_507 ['KEEP']
ch2k_CH98PIR01_116 ['REMOVE']
iso2k_1229 ['KEEP']
ch2k_XI17HAI01_128 ['KEEP', 'REMOVE']
ch2k_XI17HAI01_136 ['REMOVE', 'REMOVE']
iso2k_1762 ['KEEP', 'KEEP']
ch2k_XI17HAI01_130 ['KEEP']
ch2k_XI17HAI01_134 ['REMOVE']
ch2k_DE14DTO03_140 ['REMOVE']
ch2k_DE14DTO01_148 ['REMOVE']
ch2k_EV18ROC01_184 ['KEEP']
ch2k_EV18ROC01_186 ['REMOVE']
ch2k_CA13SAP01_188 ['REMOVE']
iso2k_569 ['KEEP']
ch2k_HE13MIS01_194 ['REMOVE', 'REMOVE']
iso2k_211 ['REMOVE']
iso2k_213 ['REMOVE']
ch2k_ZI15IMP02_200 ['KEEP']
ch2k_ZI15IMP02_202 ['REMOVE']
ch2k_PF04PBA01_204 ['REMOVE', 'REMOVE']
iso2k_1701 ['KEEP', 'KEEP']
iso2k_1704 ['KEEP', 'REMOVE']
ch2k_CO03PAL05_212 ['REMOVE']
iso2k_515 ['KEEP']
ch2k_ZI15TAN01_278 ['KEEP']
ch2k_ZI15TAN01_280 ['REMOVE']
ch2k_GU99NAU01_314 ['KEEP', 'REMOVE']
iso2k_702 ['REMOVE', 'REMOVE']
iso2k_705 ['KEEP', 'KEEP']
ch2k_CO03PAL10_324 ['REMOVE']
iso2k_519 ['KEEP']
ch2k_ZI15IMP01_328 ['KEEP']
ch2k_ZI15IMP01_330 ['REMOVE']
ch2k_RO19YUC01_338 ['KEEP']
ch2k_RO19YUC01_340 ['REMOVE']
ch2k_CO03PAL09_358 ['REMOVE']
iso2k_525 ['KEEP']
ch2k_BA04FIJ02_382 ['REMOVE']
iso2k_52 ['KEEP']
ch2k_CO03PAL06_386 ['REMOVE']
iso2k_517 ['KEEP']
ch2k_CO93TAR01_408 ['REMOVE']
iso2k_539 ['KEEP']
ch2k_CO00MAL01_412 ['REMOVE']
ch2k_DE13HAI01_424 ['KEEP', 'REMOVE']
ch2k_DE13HAI01_432 ['REMOVE', 'REMOVE']
iso2k_1643 ['KEEP', 'KEEP']
ch2k_DE13HAI01_426 ['KEEP']
ch2k_DE13HAI01_430 ['REMOVE']
ch2k_ZI15CLE01_438 ['KEEP']
ch2k_ZI15CLE01_440 ['REMOVE']
ch2k_CO03PAL04_452 ['REMOVE']
iso2k_513 ['KEEP']
ch2k_FL18DTO01_460 ['REMOVE']
ch2k_FL18DTO02_554 ['REMOVE']
ch2k_DU94URV01_468 ['KEEP']
ch2k_DU94URV01_470 ['REMOVE']
ch2k_CO03PAL08_472 ['REMOVE']
iso2k_523 ['KEEP']
ch2k_LI99CLI01_486 ['REMOVE']
iso2k_1571 ['KEEP']
ch2k_ZI15BUN01_488 ['KEEP']
ch2k_ZI15BUN01_490 ['REMOVE']
ch2k_WU13TON01_504 ['KEEP']
ch2k_WU13TON01_506 ['REMOVE']
ch2k_KI14PAR01_510 ['KEEP']
ch2k_KI14PAR01_518 ['REMOVE']
ch2k_KI14PAR01_512 ['KEEP']
ch2k_KI14PAR01_516 ['REMOVE']
ch2k_BA04FIJ01_558 ['REMOVE']
iso2k_55 ['KEEP']
iso2k_120 ['KEEP']
sisal_253.0_171 ['REMOVE']
iso2k_140 ['KEEP']
sisal_278.0_184 ['REMOVE']
iso2k_236 ['KEEP']
sisal_205.0_141 ['REMOVE']
iso2k_380 ['KEEP']
sisal_446.0_292 ['REMOVE']
iso2k_399 ['KEEP', 'KEEP']
iso2k_806 ['REMOVE', 'KEEP']
iso2k_811 ['REMOVE', 'REMOVE']
iso2k_533 ['REMOVE']
sisal_115.0_69 ['KEEP']
iso2k_546 ['KEEP']
iso2k_549 ['REMOVE']
iso2k_547 ['KEEP']
iso2k_550 ['REMOVE']
iso2k_772 ['KEEP']
iso2k_775 ['REMOVE']
iso2k_873 ['KEEP']
sisal_471.0_314 ['REMOVE']
iso2k_1107 ['KEEP', 'KEEP']
iso2k_1817 ['REMOVE', 'KEEP']
sisal_271.0_174 ['REMOVE', 'REMOVE']
iso2k_1178 ['KEEP']
sisal_201.0_133 ['REMOVE']
iso2k_1283 ['KEEP']
iso2k_1286 ['REMOVE']
iso2k_1288 ['KEEP']
sisal_329.0_213 ['REMOVE']
iso2k_1291 ['KEEP']
sisal_330.0_215 ['REMOVE']
iso2k_1495 ['KEEP']
sisal_305.0_199 ['REMOVE']
iso2k_1504 ['REMOVE']
sisal_113.0_66 ['KEEP']
iso2k_1820 ['KEEP']
sisal_272.0_177 ['REMOVE']
iso2k_1823 ['KEEP']
sisal_273.0_179 ['REMOVE']
iso2k_1848 ['KEEP']
iso2k_1855 ['REMOVE']
iso2k_1850 ['KEEP']
iso2k_1851 ['REMOVE']
sisal_46.0_18 ['KEEP']
sisal_47.0_21 ['REMOVE']
sisal_46.0_19 ['KEEP']
sisal_47.0_22 ['REMOVE']
sisal_46.0_20 ['KEEP']
sisal_47.0_23 ['REMOVE']
sisal_430.0_270 ['KEEP']
sisal_896.0_531 ['REMOVE']
sisal_430.0_271 ['KEEP']
sisal_896.0_533 ['REMOVE']

Save all the duplicate details in one dictionary, which will be used in the duplicate free dataframe (final output) df_dupfree to provide details on the duplicate detection process (duplicateDetails).

In [9]:

Copied!

# Collect duplicate details for each record
dup_details = dup.collect_dup_details(df_decisions, header)
# Collect duplicate details for each record
dup_details = dup.collect_dup_details(df_decisions, header)

Note that any one record can appear more than once and have multiple decisions associated with it (e.g. 'REMOVE', 'KEEP' or 'COMPOSITE').

In order to remove the duplicates we need to implement the following steps:

Records to be REMOVED. Remove all records from the dataframe which are associated with the decision 'REMOVE' and save in df_cleaned
Records to be COMPOSITED. Create compounds of the records and save in df_composite
Now check for records which have both 'REMOVE' and 'COMPOSITE' associated. These are potentially remaining duplicates. Here, the operator is once again asked to make decisions and run a 'mini' version of the duplicate workflow.

1. Records to be REMOVED¶

First simply remove all the records to which the decision 'REMOVE' and/or 'COMPOSITE' applies to and store in df_cleaned, while all 'REMOVE' type records are stored in df_duplica (for later inspection).

In [10]:

Copied!





# load the records TO BE REMOVED OR COMPOSITED
remove_IDs  = list(df_decisions['datasetId 1'][np.isin(df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
remove_IDs += list(df_decisions['datasetId 2'][np.isin(df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
remove_IDs  = np.unique(remove_IDs)

df_duplica =  df.loc[remove_IDs, 'datasetId'] # df containing only records which were removed
df_cleaned =  df.drop(remove_IDs) # df freed from 'REMOVE' type duplicates

print(f'Removed {len(df_duplica)} REMOVE or COMPOSITE type records.')
print(f'REMOVE type duplicate free dataset contains {len(df_cleaned)} records.')
print('Removed the following IDs:', remove_IDs)


print(df.name)
# load the records TO BE REMOVED OR COMPOSITED
remove_IDs  = list(df_decisions['datasetId 1'][np.isin(df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
remove_IDs += list(df_decisions['datasetId 2'][np.isin(df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
remove_IDs  = np.unique(remove_IDs)

df_duplica =  df.loc[remove_IDs, 'datasetId'] # df containing only records which were removed
df_cleaned =  df.drop(remove_IDs) # df freed from 'REMOVE' type duplicates

print(f'Removed {len(df_duplica)} REMOVE or COMPOSITE type records.')
print(f'REMOVE type duplicate free dataset contains {len(df_cleaned)} records.')
print('Removed the following IDs:', remove_IDs)


print(df.name)

Removed 367 REMOVE or COMPOSITE type records.
REMOVE type duplicate free dataset contains 4780 records.
Removed the following IDs: ['FE23_australia_newz003' 'FE23_australia_newz008'
 'FE23_australia_newz014' 'FE23_australia_newz018'
 'FE23_australia_newz019' 'FE23_europe_swed019w' 'FE23_europe_swed021w'
 'FE23_northamerica_canada_cana100' 'FE23_northamerica_canada_cana105'
 'FE23_northamerica_canada_cana162' 'FE23_northamerica_canada_cana168w'
 'FE23_northamerica_mexico_mexi022' 'FE23_northamerica_usa_ca066'
 'FE23_northamerica_usa_ca067' 'FE23_northamerica_usa_ca512'
 'FE23_northamerica_usa_ca535' 'FE23_northamerica_usa_me017'
 'FE23_northamerica_usa_mo009' 'FE23_northamerica_usa_mt113'
 'FE23_northamerica_usa_nj002' 'FE23_northamerica_usa_nm055'
 'FE23_northamerica_usa_nv060' 'FE23_northamerica_usa_nv512'
 'FE23_northamerica_usa_nv513' 'FE23_southamerica_arge085'
 'FE23_southamerica_chil017' 'ch2k_AS05GUA01_302' 'ch2k_BA04FIJ01_558'
 'ch2k_BA04FIJ02_382' 'ch2k_CA07FLI01_400' 'ch2k_CA13SAP01_188'
 'ch2k_CA14TIM01_64' 'ch2k_CH98PIR01_116' 'ch2k_CO00MAL01_412'
 'ch2k_CO03PAL01_110' 'ch2k_CO03PAL02_8' 'ch2k_CO03PAL03_6'
 'ch2k_CO03PAL04_452' 'ch2k_CO03PAL05_212' 'ch2k_CO03PAL06_386'
 'ch2k_CO03PAL07_14' 'ch2k_CO03PAL08_472' 'ch2k_CO03PAL09_358'
 'ch2k_CO03PAL10_324' 'ch2k_CO93TAR01_408' 'ch2k_DA06MAF01_78'
 'ch2k_DA06MAF02_104' 'ch2k_DE13HAI01_424' 'ch2k_DE13HAI01_430'
 'ch2k_DE13HAI01_432' 'ch2k_DE14DTO01_148' 'ch2k_DE14DTO03_140'
 'ch2k_DR99ABR01_264' 'ch2k_DR99ABR01_266' 'ch2k_DU94URV01_470'
 'ch2k_EV18ROC01_186' 'ch2k_FE09OGA01_304' 'ch2k_FE18RUS01_492'
 'ch2k_FL18DTO01_460' 'ch2k_FL18DTO02_554' 'ch2k_GO12SBV01_396'
 'ch2k_GU99NAU01_314' 'ch2k_HE08LRA01_76' 'ch2k_HE10GUA01_244'
 'ch2k_HE13MIS01_194' 'ch2k_KI04MCV01_366' 'ch2k_KI14PAR01_516'
 'ch2k_KI14PAR01_518' 'ch2k_KU00NIN01_150' 'ch2k_KU99HOU01_40'
 'ch2k_LI06FIJ01_582' 'ch2k_LI06RAR01_12' 'ch2k_LI06RAR02_270'
 'ch2k_LI94SEC01_436' 'ch2k_LI99CLI01_486' 'ch2k_MO06PED01_226'
 'ch2k_NA09MAL01_84' 'ch2k_NU11PAL01_52' 'ch2k_OS14UCP01_236'
 'ch2k_PF04PBA01_204' 'ch2k_QU06RAB01_144' 'ch2k_QU96ESV01_422'
 'ch2k_RE18CAY01_30' 'ch2k_RO19YUC01_340' 'ch2k_SW98STP01_86'
 'ch2k_TU01DEP01_450' 'ch2k_TU95MAD01_24' 'ch2k_UR00MAI01_22'
 'ch2k_WU13TON01_506' 'ch2k_XI17HAI01_128' 'ch2k_XI17HAI01_134'
 'ch2k_XI17HAI01_136' 'ch2k_ZI04IFR01_26' 'ch2k_ZI14IFR02_524'
 'ch2k_ZI14TUR01_480' 'ch2k_ZI14TUR01_482' 'ch2k_ZI15BUN01_490'
 'ch2k_ZI15CLE01_440' 'ch2k_ZI15IMP01_330' 'ch2k_ZI15IMP02_202'
 'ch2k_ZI15MER01_4' 'ch2k_ZI15TAN01_280' 'iso2k_1068' 'iso2k_1069'
 'iso2k_1286' 'iso2k_1504' 'iso2k_1554' 'iso2k_1704' 'iso2k_1817'
 'iso2k_1851' 'iso2k_1855' 'iso2k_211' 'iso2k_213' 'iso2k_298' 'iso2k_299'
 'iso2k_505' 'iso2k_533' 'iso2k_549' 'iso2k_550' 'iso2k_579' 'iso2k_58'
 'iso2k_702' 'iso2k_775' 'iso2k_786' 'iso2k_806' 'iso2k_811' 'iso2k_98'
 'pages2k_0' 'pages2k_1004' 'pages2k_1026' 'pages2k_1048' 'pages2k_107'
 'pages2k_1089' 'pages2k_1108' 'pages2k_1116' 'pages2k_1147'
 'pages2k_1156' 'pages2k_1160' 'pages2k_1209' 'pages2k_122' 'pages2k_1252'
 'pages2k_1274' 'pages2k_1293' 'pages2k_132' 'pages2k_1325' 'pages2k_1360'
 'pages2k_1365' 'pages2k_1370' 'pages2k_1420' 'pages2k_1444'
 'pages2k_1488' 'pages2k_1490' 'pages2k_1491' 'pages2k_1497'
 'pages2k_1519' 'pages2k_1520' 'pages2k_1547' 'pages2k_1566' 'pages2k_158'
 'pages2k_1605' 'pages2k_1623' 'pages2k_1636' 'pages2k_1688'
 'pages2k_1692' 'pages2k_1703' 'pages2k_171' 'pages2k_1712' 'pages2k_1720'
 'pages2k_1741' 'pages2k_1750' 'pages2k_1771' 'pages2k_1804'
 'pages2k_1859' 'pages2k_1861' 'pages2k_1880' 'pages2k_1891'
 'pages2k_1918' 'pages2k_1923' 'pages2k_1932' 'pages2k_1942'
 'pages2k_1973' 'pages2k_1980' 'pages2k_1983' 'pages2k_1985'
 'pages2k_1991' 'pages2k_1994' 'pages2k_2013' 'pages2k_203' 'pages2k_2042'
 'pages2k_2059' 'pages2k_2085' 'pages2k_2094' 'pages2k_2103'
 'pages2k_2110' 'pages2k_2149' 'pages2k_2150' 'pages2k_2156'
 'pages2k_2214' 'pages2k_2220' 'pages2k_2226' 'pages2k_225' 'pages2k_2265'
 'pages2k_2290' 'pages2k_2300' 'pages2k_2303' 'pages2k_2309'
 'pages2k_2311' 'pages2k_2319' 'pages2k_2344' 'pages2k_2361' 'pages2k_238'
 'pages2k_2402' 'pages2k_242' 'pages2k_2430' 'pages2k_2473' 'pages2k_2502'
 'pages2k_2510' 'pages2k_2514' 'pages2k_2517' 'pages2k_2534'
 'pages2k_2538' 'pages2k_2561' 'pages2k_258' 'pages2k_2595' 'pages2k_2596'
 'pages2k_2599' 'pages2k_2604' 'pages2k_2606' 'pages2k_2609'
 'pages2k_2612' 'pages2k_2613' 'pages2k_2617' 'pages2k_263' 'pages2k_2634'
 'pages2k_2660' 'pages2k_267' 'pages2k_2677' 'pages2k_2703' 'pages2k_271'
 'pages2k_2722' 'pages2k_273' 'pages2k_2750' 'pages2k_2755' 'pages2k_2759'
 'pages2k_2795' 'pages2k_2798' 'pages2k_281' 'pages2k_2830' 'pages2k_2843'
 'pages2k_2901' 'pages2k_2906' 'pages2k_2922' 'pages2k_294' 'pages2k_2953'
 'pages2k_2959' 'pages2k_2976' 'pages2k_3002' 'pages2k_3030'
 'pages2k_3033' 'pages2k_3038' 'pages2k_305' 'pages2k_3064' 'pages2k_3068'
 'pages2k_307' 'pages2k_3085' 'pages2k_3107' 'pages2k_3108' 'pages2k_3132'
 'pages2k_3134' 'pages2k_315' 'pages2k_317' 'pages2k_3170' 'pages2k_3179'
 'pages2k_3191' 'pages2k_3196' 'pages2k_3202' 'pages2k_323' 'pages2k_3236'
 'pages2k_3239' 'pages2k_3243' 'pages2k_3263' 'pages2k_3266'
 'pages2k_3307' 'pages2k_3313' 'pages2k_3342' 'pages2k_3352'
 'pages2k_3372' 'pages2k_3374' 'pages2k_3404' 'pages2k_3419'
 'pages2k_3503' 'pages2k_3524' 'pages2k_3550' 'pages2k_3552'
 'pages2k_3554' 'pages2k_3571' 'pages2k_3583' 'pages2k_3599'
 'pages2k_3609' 'pages2k_3631' 'pages2k_3642' 'pages2k_385' 'pages2k_387'
 'pages2k_395' 'pages2k_397' 'pages2k_409' 'pages2k_418' 'pages2k_421'
 'pages2k_433' 'pages2k_445' 'pages2k_446' 'pages2k_462' 'pages2k_468'
 'pages2k_474' 'pages2k_477' 'pages2k_478' 'pages2k_486' 'pages2k_495'
 'pages2k_50' 'pages2k_500' 'pages2k_541' 'pages2k_565' 'pages2k_583'
 'pages2k_592' 'pages2k_6' 'pages2k_610' 'pages2k_626' 'pages2k_63'
 'pages2k_691' 'pages2k_730' 'pages2k_736' 'pages2k_800' 'pages2k_81'
 'pages2k_818' 'pages2k_83' 'pages2k_830' 'pages2k_831' 'pages2k_842'
 'pages2k_857' 'pages2k_88' 'pages2k_881' 'pages2k_895' 'pages2k_900'
 'pages2k_94' 'pages2k_940' 'pages2k_945' 'pages2k_960' 'pages2k_976'
 'pages2k_982' 'sisal_201.0_133' 'sisal_205.0_141' 'sisal_253.0_171'
 'sisal_271.0_174' 'sisal_272.0_177' 'sisal_273.0_179' 'sisal_278.0_184'
 'sisal_294.0_194' 'sisal_305.0_199' 'sisal_329.0_213' 'sisal_330.0_215'
 'sisal_446.0_292' 'sisal_47.0_21' 'sisal_47.0_22' 'sisal_47.0_23'
 'sisal_471.0_314' 'sisal_896.0_531' 'sisal_896.0_533']
all_merged

In [11]:

Copied!





# add columns on decision process to df_cleaned:
df_cleaned['duplicateDetails']='N/A'
for ID in dup_details:
    if ID in df_cleaned.index: 
        if df_cleaned.at[ID, 'duplicateDetails']=='N/A': 
            df_cleaned.at[ID, 'duplicateDetails']=dup_details[ID]
        else: df_cleaned.at[ID, 'duplicateDetails']+=dup_details[ID]
# add columns on decision process to df_cleaned:
df_cleaned['duplicateDetails']='N/A'
for ID in dup_details:
    if ID in df_cleaned.index: 
        if df_cleaned.at[ID, 'duplicateDetails']=='N/A': 
            df_cleaned.at[ID, 'duplicateDetails']=dup_details[ID]
        else: df_cleaned.at[ID, 'duplicateDetails']+=dup_details[ID]

In [12]:

Copied!

# df_cleaned[df_cleaned[ 'duplicateDetails']!='N/A'].at['ch2k_DE14DTO03_140', 'duplicateDetails']
# df_cleaned[df_cleaned[ 'duplicateDetails']!='N/A'].at['ch2k_DE14DTO03_140', 'duplicateDetails']

2. Records to be COMPOSITED¶

Now identify all the records to which the decision 'COMPOSITE' applies to, create composites and store in df_composite.

In [13]:

Copied!





# add the column 'duplicateDetails' to df, in case it does not exist
if 'duplicateDetails' not in df.columns: df['duplicateDetails']='N/A'

# load the records to be composited
comp_ID_pairs = df_decisions[(df_decisions['Decision 1']=='COMPOSITE')&(df_decisions['Decision 2']=='COMPOSITE')]

# create new composite data and metadata from the pairs
# loop through the composite pairs and check metadata
df_composite = dup.join_composites_metadata(df, comp_ID_pairs, df_decisions, header)
# add the column 'duplicateDetails' to df, in case it does not exist
if 'duplicateDetails' not in df.columns: df['duplicateDetails']='N/A'

# load the records to be composited
comp_ID_pairs = df_decisions[(df_decisions['Decision 1']=='COMPOSITE')&(df_decisions['Decision 2']=='COMPOSITE')]

# create new composite data and metadata from the pairs
# loop through the composite pairs and check metadata
df_composite = dup.join_composites_metadata(df, comp_ID_pairs, df_decisions, header)

FE23_europe_swed019w FE23_europe_swed021w
--------------------------------------------------------------------------------
Metadata different for >>>geo_siteName<<< in: FE23_europe_swed019w (Torneträskr+f.,Bartoli) and FE23_europe_swed021w (Torneträskfos.,Bartoli). 
--------------------------------------------------------------------------------
Metadata different for >>>interpretation_variable<<< in: FE23_europe_swed019w (NOT temperature NOT moisture) and FE23_europe_swed021w (N/A).

No description has been provided for this image

saved figure in /home/jupyter-lluecke/dod2k/figs//all_merged/dup_detection//composite_FE23_europe_swed019w_FE23_europe_swed021w.pdf

In [14]:

Copied!

print(df_composite.info())
print(df_composite.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    1 non-null      object 
 1   geo_meanElev                   1 non-null      float32
 2   geo_meanLat                    1 non-null      float32
 3   geo_meanLon                    1 non-null      float32
 4   geo_siteName                   1 non-null      object 
 5   paleoData_proxy                1 non-null      object 
 6   yearUnits                      1 non-null      object 
 7   interpretation_variable        1 non-null      object 
 8   interpretation_direction       1 non-null      object 
 9   interpretation_seasonality     1 non-null      object 
 10  paleoData_values               1 non-null      object 
 11  year                           1 non-null      object 
 12  dataSetName                    1 non-null      object 
 13  originalDatabase               1 non-null      object 
 14  originalDataURL                1 non-null      object 
 15  paleoData_notes                1 non-null      object 
 16  interpretation_variableDetail  1 non-null      object 
 17  datasetId                      1 non-null      object 
 18  paleoData_units                1 non-null      object 
 19  duplicateDetails               1 non-null      object 
dtypes: float32(3), object(17)
memory usage: 280.0+ bytes
None

3. Check for overlap between duplicates¶

The duplicate free dataframe is obtained by joining

df_cleaned (duplicate free as all records with decision REMOVE and/or COMPOSITE removed) and
df_composite (dupicate free as duplicates are composited)

However, there might still be duplicates between the two dataframes when a record has been associated with more than 1 duplicate candidate pair.

The scenarios for duplicates appearing twice:

REMOVE/KEEP and COMPOSITE:

duplicate pair a and b have had the decisions assigned: a-> REMOVE, b -> KEEP
duplicate pair a and c have had the decisions assigned: a -> COMPOSITE, c -> COMPOSITE.

In this case, b and ac (the composite record of a and c) would be duplicates in the merged dataframe

2a. REMOVE/KEEP & KEEP/REMOVE:

duplicate pair a and b have had the decisions assigned: a-> REMOVE, b -> KEEP
duplicate pair a and c have had the decisions assigned: a -> KEEP, c -> REMOVE.

In this case a would still be removed as REMOVE overrides KEEP in the algorithm. So only b will be kept and no duplicates would remain.

2b. REMOVE/KEEP & REMOVE/KEEP

duplicate pair a and b have had the decisions assigned: a-> REMOVE, b -> KEEP
duplicate pair a and c have had the decisions assigned: a -> REMOVE, c -> KEEP.

In this case, a would be removed, but b and c will be kept and would be duplicates in the merged dataframe.

COMPOSITE x 2

duplicate pair a and b have had the decisions assigned: a-> COMPOSITE, b -> COMPOSITE
duplicate pair a and c have had the decisions assigned: a -> COMPOSITE, c -> COMPOSITE.

In this case, ab and ac would be duplicates in the merged dataframe.

REMOVE/KEEP and KEEP/KEEP

duplicate pair a and b have had the decisions assigned: a-> REMOVE, b -> KEEP
duplicate pair a and c have had the decisions assigned: a -> KEEP, c -> KEEP.

In this case a would be removed, b and c would be kept but as c is not a duplicate of a no duplicates would remain.

Therefore, we loop through the records in the joined dataframe which have been associated with multiple duplicates.

Merge the composites and the dataframe freed from REMOVE and COMPOSITE type records

In [15]:

Copied!





# initiate the loop
tmp_df_dupfree = pd.concat([df_cleaned, df_composite])
tmp_df_dupfree.index = tmp_df_dupfree['datasetId']
tmp_decisions = decisions.copy()
# initiate the loop
tmp_df_dupfree = pd.concat([df_cleaned, df_composite])
tmp_df_dupfree.index = tmp_df_dupfree['datasetId']
tmp_decisions = decisions.copy()

Create a loop which implements a mini duplicate detection on all the records which have multiple decisions associated.

In [16]:

Copied!





# Simple composite tracking for debugging only
composite_log = []

for ii in range(10): 

    
    tmp_df_dupfree.set_index('datasetId', inplace = True)
    tmp_df_dupfree['datasetId']=tmp_df_dupfree.index
    
    print('-'*20)
    print(f'ITERATION # {ii}')
    
    multiple_dups = []
    for id in tmp_decisions.keys():
        if len(tmp_decisions[id]) > 1:
            if id not in multiple_dups:
                multiple_dups.append(id)
    
    if len(multiple_dups) > 0:
        # Check which of the multiple duplicate IDs are still in the dataframe
        multiple_dups_new = []
        current_ids = set(tmp_df_dupfree.index)  # Get all current IDs as a set
        
        for id in multiple_dups:
            if id in current_ids:  # Simple membership check
                multiple_dups_new.append(id)
        
        if len(multiple_dups_new) > 0:
            print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
            print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
        else:
            print('No more multiple duplicates found in current dataframe.')
            print('SUCCESS!!')
            break
    else:
        print('No more multiple duplicates.')
        print('SUCCESS!!')
        break
    
    # Now we create a small dataframe which needs to be checked for duplicates.
    df_check = tmp_df_dupfree.copy()[np.isin(tmp_df_dupfree['datasetId'], multiple_dups_new)]
    print('Check dataframe: ')
    df_check.name = 'tmp'
    df_check.index = range(len(df_check))
    print(df_check.info())
    # We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
    pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
    if len(pot_dup_IDs)==0:
        print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
        break
    else:
        yn=''
        while yn not in ['y', 'n']:
            yn = input('Do you want to continue with the decision process for duplicates? [y/n]')
        if yn=='n': break
    
    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False, comment=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)

    
    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]
    
    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index
    
    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
    
    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
    
    if len(tmp_comp_ID_pairs) > 0:
        for _, pair in tmp_comp_ID_pairs.iterrows():
            id1, id2 = pair['datasetId 1'], pair['datasetId 2']
            # Log what was composited
            composite_log.append({
                'iteration': ii,
                'composited': [id1, id2],
                'new_id': f"{id1}_{id2}_composite"  # or however you generate it
            })
    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)

    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
    print('--'*20)
    print('Finished iteration.')
    
    print('NEW DATAFRAME:')
    print(tmp_df_dupfree.info())

    print('--'*20)
    print('--'*20)
    if ii==19: print('STILL DUPLICATES PRESENT AFTER MULTIPLE ITERATIONS! REVISE DECISION PROCESS!!')

    print('--'*20)

print(f"Created {len(composite_log)} composites across all iterations")
# Simple composite tracking for debugging only
composite_log = []

for ii in range(10): 

    
    tmp_df_dupfree.set_index('datasetId', inplace = True)
    tmp_df_dupfree['datasetId']=tmp_df_dupfree.index
    
    print('-'*20)
    print(f'ITERATION # {ii}')
    
    multiple_dups = []
    for id in tmp_decisions.keys():
        if len(tmp_decisions[id]) > 1:
            if id not in multiple_dups:
                multiple_dups.append(id)
    
    if len(multiple_dups) > 0:
        # Check which of the multiple duplicate IDs are still in the dataframe
        multiple_dups_new = []
        current_ids = set(tmp_df_dupfree.index)  # Get all current IDs as a set
        
        for id in multiple_dups:
            if id in current_ids:  # Simple membership check
                multiple_dups_new.append(id)
        
        if len(multiple_dups_new) > 0:
            print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
            print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
        else:
            print('No more multiple duplicates found in current dataframe.')
            print('SUCCESS!!')
            break
    else:
        print('No more multiple duplicates.')
        print('SUCCESS!!')
        break
    
    # Now we create a small dataframe which needs to be checked for duplicates.
    df_check = tmp_df_dupfree.copy()[np.isin(tmp_df_dupfree['datasetId'], multiple_dups_new)]
    print('Check dataframe: ')
    df_check.name = 'tmp'
    df_check.index = range(len(df_check))
    print(df_check.info())
    # We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
    pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
    if len(pot_dup_IDs)==0:
        print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
        break
    else:
        yn=''
        while yn not in ['y', 'n']:
            yn = input('Do you want to continue with the decision process for duplicates? [y/n]')
        if yn=='n': break
    
    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False, comment=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)

    
    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]
    
    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index
    
    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
    
    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
    
    if len(tmp_comp_ID_pairs) > 0:
        for _, pair in tmp_comp_ID_pairs.iterrows():
            id1, id2 = pair['datasetId 1'], pair['datasetId 2']
            # Log what was composited
            composite_log.append({
                'iteration': ii,
                'composited': [id1, id2],
                'new_id': f"{id1}_{id2}_composite"  # or however you generate it
            })
    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)

    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
    print('--'*20)
    print('Finished iteration.')
    
    print('NEW DATAFRAME:')
    print(tmp_df_dupfree.info())

    print('--'*20)
    print('--'*20)
    if ii==19: print('STILL DUPLICATES PRESENT AFTER MULTIPLE ITERATIONS! REVISE DECISION PROCESS!!')

    print('--'*20)

print(f"Created {len(composite_log)} composites across all iterations")

--------------------
ITERATION # 0
WARNING! Decisions associated with 54 multiple duplicates in the new dataframe.
Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.
Check dataframe: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    54 non-null     object 
 1   dataSetName                    54 non-null     object 
 2   geo_meanElev                   53 non-null     float32
 3   geo_meanLat                    54 non-null     float32
 4   geo_meanLon                    54 non-null     float32
 5   geo_siteName                   54 non-null     object 
 6   interpretation_direction       54 non-null     object 
 7   interpretation_seasonality     54 non-null     object 
 8   interpretation_variable        54 non-null     object 
 9   interpretation_variableDetail  54 non-null     object 
 10  originalDataURL                54 non-null     object 
 11  originalDatabase               54 non-null     object 
 12  paleoData_notes                54 non-null     object 
 13  paleoData_proxy                54 non-null     object 
 14  paleoData_sensorSpecies        54 non-null     object 
 15  paleoData_units                54 non-null     object 
 16  paleoData_values               54 non-null     object 
 17  paleoData_variableName         54 non-null     object 
 18  year                           54 non-null     object 
 19  yearUnits                      54 non-null     object 
 20  duplicateDetails               54 non-null     object 
 21  datasetId                      54 non-null     object 
dtypes: float32(3), object(19)
memory usage: 8.8+ KB
None
tmp
Start duplicate search:
=================================
checking parameters:
proxy archive                  :  must match     
proxy type                     :  must match     
distance (km)                  < 8               
elevation                      :  must match     
time overlap                   > 10              
correlation                    > 0.9             
RMSE                           < 0.1             
1st difference rmse            < 0.1             
correlation of 1st difference  > 0.9             
=================================
Start duplicate search
Progress: 0/54
Progress: 10/54
Progress: 20/54
Progress: 30/54
Progress: 40/54
Progress: 50/54
============================================================
Saved indices, IDs, distances, correlations in data/tmp/dup_detection/
============================================================
Detected 0 possible duplicates in tmp.
============================================================
============================================================
Indices: 
IDs: 
============================================================
SUCCESS!! NO MORE DUPLICATES DETECTED!!
Created 0 composites across all iterations

4. Create duplicate free dataframe¶

Do another duplicate search on the whole dataframe to make sure there are no duplicates present anymore.

In [17]:

Copied!





tmp_df_dupfree.set_index('datasetId', inplace = True)
tmp_df_dupfree['datasetId']=tmp_df_dupfree.index

# Now we create a  dataframe which needs to be checked for duplicates.
df_check = tmp_df_dupfree.copy()
df_check.name = 'tmp'
df_check.index = range(len(df_check))
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
    print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
else:
    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)
    
    
    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]
    
    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index
    
    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
    
    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
    
    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
    
    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
    
    print('Finished last round of duplicate removal.')
    print('Potentially run through this cell again to check for remaining duplicates.')
      
tmp_df_dupfree.set_index('datasetId', inplace = True)
tmp_df_dupfree['datasetId']=tmp_df_dupfree.index

# Now we create a  dataframe which needs to be checked for duplicates.
df_check = tmp_df_dupfree.copy()
df_check.name = 'tmp'
df_check.index = range(len(df_check))
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
    print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
else:
    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)
    
    
    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]
    
    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index
    
    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
    
    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
    
    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
    
    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
    
    print('Finished last round of duplicate removal.')
    print('Potentially run through this cell again to check for remaining duplicates.')
      

tmp
Start duplicate search:
=================================
checking parameters:
proxy archive                  :  must match     
proxy type                     :  must match     
distance (km)                  < 8               
elevation                      :  must match     
time overlap                   > 10              
correlation                    > 0.9             
RMSE                           < 0.1             
1st difference rmse            < 0.1             
correlation of 1st difference  > 0.9             
=================================
Start duplicate search
Progress: 0/4781
Progress: 10/4781
Progress: 20/4781
Progress: 30/4781
Progress: 40/4781
Progress: 50/4781
Progress: 60/4781
Progress: 70/4781
Progress: 80/4781
Progress: 90/4781
Progress: 100/4781
Progress: 110/4781
Progress: 120/4781
Progress: 130/4781
Progress: 140/4781
Progress: 150/4781
Progress: 160/4781
Progress: 170/4781
Progress: 180/4781
Progress: 190/4781
Progress: 200/4781
Progress: 210/4781
Progress: 220/4781
Progress: 230/4781
Progress: 240/4781
Progress: 250/4781
Progress: 260/4781
Progress: 270/4781
Progress: 280/4781
Progress: 290/4781
Progress: 300/4781
Progress: 310/4781
Progress: 320/4781
Progress: 330/4781
Progress: 340/4781
Progress: 350/4781
Progress: 360/4781
Progress: 370/4781
Progress: 380/4781
Progress: 390/4781
Progress: 400/4781
Progress: 410/4781
Progress: 420/4781
Progress: 430/4781
Progress: 440/4781
Progress: 450/4781
Progress: 460/4781
Progress: 470/4781
Progress: 480/4781
Progress: 490/4781
Progress: 500/4781
Progress: 510/4781
Progress: 520/4781
Progress: 530/4781
Progress: 540/4781
Progress: 550/4781
Progress: 560/4781
Progress: 570/4781
Progress: 580/4781
Progress: 590/4781
Progress: 600/4781
Progress: 610/4781
Progress: 620/4781
Progress: 630/4781
Progress: 640/4781
Progress: 650/4781
Progress: 660/4781
Progress: 670/4781
Progress: 680/4781
Progress: 690/4781
Progress: 700/4781
Progress: 710/4781
Progress: 720/4781
Progress: 730/4781
Progress: 740/4781
Progress: 750/4781
Progress: 760/4781
Progress: 770/4781
Progress: 780/4781
Progress: 790/4781
Progress: 800/4781
Progress: 810/4781
Progress: 820/4781
Progress: 830/4781
Progress: 840/4781
Progress: 850/4781
Progress: 860/4781
Progress: 870/4781
Progress: 880/4781
Progress: 890/4781
Progress: 900/4781
Progress: 910/4781
Progress: 920/4781
Progress: 930/4781
Progress: 940/4781
Progress: 950/4781
Progress: 960/4781
Progress: 970/4781
Progress: 980/4781
Progress: 990/4781
Progress: 1000/4781
Progress: 1010/4781
Progress: 1020/4781
Progress: 1030/4781
Progress: 1040/4781
Progress: 1050/4781
Progress: 1060/4781
Progress: 1070/4781
Progress: 1080/4781
Progress: 1090/4781
Progress: 1100/4781
Progress: 1110/4781
Progress: 1120/4781
Progress: 1130/4781
Progress: 1140/4781
Progress: 1150/4781
Progress: 1160/4781
Progress: 1170/4781
Progress: 1180/4781
Progress: 1190/4781
Progress: 1200/4781
Progress: 1210/4781
Progress: 1220/4781
Progress: 1230/4781
Progress: 1240/4781
Progress: 1250/4781
Progress: 1260/4781
Progress: 1270/4781
Progress: 1280/4781
Progress: 1290/4781
Progress: 1300/4781
Progress: 1310/4781
Progress: 1320/4781
Progress: 1330/4781
Progress: 1340/4781
Progress: 1350/4781
Progress: 1360/4781
Progress: 1370/4781
Progress: 1380/4781
Progress: 1390/4781
Progress: 1400/4781
Progress: 1410/4781
Progress: 1420/4781
Progress: 1430/4781
Progress: 1440/4781
Progress: 1450/4781
Progress: 1460/4781
Progress: 1470/4781
Progress: 1480/4781
Progress: 1490/4781
Progress: 1500/4781
Progress: 1510/4781
Progress: 1520/4781
Progress: 1530/4781
Progress: 1540/4781
Progress: 1550/4781
Progress: 1560/4781
Progress: 1570/4781
Progress: 1580/4781
Progress: 1590/4781
Progress: 1600/4781
Progress: 1610/4781
Progress: 1620/4781
Progress: 1630/4781
Progress: 1640/4781
Progress: 1650/4781
Progress: 1660/4781
Progress: 1670/4781
Progress: 1680/4781
Progress: 1690/4781
Progress: 1700/4781
Progress: 1710/4781
Progress: 1720/4781
Progress: 1730/4781
Progress: 1740/4781
Progress: 1750/4781
Progress: 1760/4781
Progress: 1770/4781
Progress: 1780/4781
Progress: 1790/4781
Progress: 1800/4781
Progress: 1810/4781
Progress: 1820/4781
Progress: 1830/4781
Progress: 1840/4781
Progress: 1850/4781
Progress: 1860/4781
Progress: 1870/4781
Progress: 1880/4781
Progress: 1890/4781
Progress: 1900/4781
Progress: 1910/4781
Progress: 1920/4781
Progress: 1930/4781
Progress: 1940/4781
Progress: 1950/4781
Progress: 1960/4781
Progress: 1970/4781
Progress: 1980/4781
Progress: 1990/4781
Progress: 2000/4781
Progress: 2010/4781
Progress: 2020/4781
Progress: 2030/4781
Progress: 2040/4781
Progress: 2050/4781
Progress: 2060/4781
Progress: 2070/4781
Progress: 2080/4781
Progress: 2090/4781
Progress: 2100/4781
Progress: 2110/4781
Progress: 2120/4781
Progress: 2130/4781
Progress: 2140/4781
Progress: 2150/4781
Progress: 2160/4781
Progress: 2170/4781
Progress: 2180/4781
Progress: 2190/4781
Progress: 2200/4781
Progress: 2210/4781
Progress: 2220/4781
Progress: 2230/4781
Progress: 2240/4781
Progress: 2250/4781
Progress: 2260/4781
Progress: 2270/4781
Progress: 2280/4781
Progress: 2290/4781
Progress: 2300/4781
Progress: 2310/4781
Progress: 2320/4781
Progress: 2330/4781
Progress: 2340/4781
Progress: 2350/4781
Progress: 2360/4781
Progress: 2370/4781
Progress: 2380/4781
Progress: 2390/4781
Progress: 2400/4781
Progress: 2410/4781
Progress: 2420/4781
Progress: 2430/4781
Progress: 2440/4781
Progress: 2450/4781
Progress: 2460/4781
Progress: 2470/4781
Progress: 2480/4781
Progress: 2490/4781
Progress: 2500/4781
Progress: 2510/4781
Progress: 2520/4781
Progress: 2530/4781
Progress: 2540/4781
Progress: 2550/4781
Progress: 2560/4781
Progress: 2570/4781
Progress: 2580/4781
Progress: 2590/4781
Progress: 2600/4781
Progress: 2610/4781
Progress: 2620/4781
Progress: 2630/4781
Progress: 2640/4781
Progress: 2650/4781
Progress: 2660/4781
Progress: 2670/4781
Progress: 2680/4781
Progress: 2690/4781
Progress: 2700/4781
Progress: 2710/4781
Progress: 2720/4781
Progress: 2730/4781
Progress: 2740/4781
Progress: 2750/4781
Progress: 2760/4781
Progress: 2770/4781
Progress: 2780/4781
Progress: 2790/4781
Progress: 2800/4781
Progress: 2810/4781
Progress: 2820/4781
Progress: 2830/4781
Progress: 2840/4781
Progress: 2850/4781
Progress: 2860/4781
Progress: 2870/4781
Progress: 2880/4781
Progress: 2890/4781
Progress: 2900/4781
Progress: 2910/4781
Progress: 2920/4781
Progress: 2930/4781
Progress: 2940/4781
Progress: 2950/4781
Progress: 2960/4781
Progress: 2970/4781
Progress: 2980/4781
Progress: 2990/4781
Progress: 3000/4781
Progress: 3010/4781
Progress: 3020/4781
Progress: 3030/4781
Progress: 3040/4781
Progress: 3050/4781
Progress: 3060/4781
Progress: 3070/4781
Progress: 3080/4781
Progress: 3090/4781
Progress: 3100/4781
Progress: 3110/4781
Progress: 3120/4781
Progress: 3130/4781
Progress: 3140/4781
Progress: 3150/4781
Progress: 3160/4781
Progress: 3170/4781
Progress: 3180/4781
Progress: 3190/4781
Progress: 3200/4781
Progress: 3210/4781
Progress: 3220/4781
Progress: 3230/4781
Progress: 3240/4781
Progress: 3250/4781
Progress: 3260/4781
Progress: 3270/4781
Progress: 3280/4781
Progress: 3290/4781
Progress: 3300/4781
Progress: 3310/4781
Progress: 3320/4781
Progress: 3330/4781
Progress: 3340/4781
Progress: 3350/4781
Progress: 3360/4781
Progress: 3370/4781
Progress: 3380/4781
Progress: 3390/4781
Progress: 3400/4781
Progress: 3410/4781
Progress: 3420/4781
Progress: 3430/4781
Progress: 3440/4781
Progress: 3450/4781
Progress: 3460/4781
Progress: 3470/4781
Progress: 3480/4781
Progress: 3490/4781
Progress: 3500/4781
Progress: 3510/4781
Progress: 3520/4781
Progress: 3530/4781
Progress: 3540/4781
Progress: 3550/4781
Progress: 3560/4781
Progress: 3570/4781
Progress: 3580/4781
Progress: 3590/4781
Progress: 3600/4781
Progress: 3610/4781
Progress: 3620/4781
Progress: 3630/4781
Progress: 3640/4781
Progress: 3650/4781
Progress: 3660/4781
Progress: 3670/4781
Progress: 3680/4781
Progress: 3690/4781
Progress: 3700/4781
Progress: 3710/4781
Progress: 3720/4781
Progress: 3730/4781
Progress: 3740/4781
Progress: 3750/4781
Progress: 3760/4781
Progress: 3770/4781
Progress: 3780/4781
Progress: 3790/4781
Progress: 3800/4781
Progress: 3810/4781
Progress: 3820/4781
Progress: 3830/4781
Progress: 3840/4781
Progress: 3850/4781
Progress: 3860/4781
Progress: 3870/4781
Progress: 3880/4781
Progress: 3890/4781
Progress: 3900/4781
Progress: 3910/4781
Progress: 3920/4781
Progress: 3930/4781

/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]

Progress: 3940/4781
Progress: 3950/4781
Progress: 3960/4781
Progress: 3970/4781
Progress: 3980/4781
Progress: 3990/4781
Progress: 4000/4781
Progress: 4010/4781
Progress: 4020/4781
Progress: 4030/4781
Progress: 4040/4781
Progress: 4050/4781
Progress: 4060/4781
Progress: 4070/4781
Progress: 4080/4781
Progress: 4090/4781
Progress: 4100/4781
Progress: 4110/4781
Progress: 4120/4781
Progress: 4130/4781
Progress: 4140/4781
Progress: 4150/4781
Progress: 4160/4781
Progress: 4170/4781
Progress: 4180/4781
Progress: 4190/4781
Progress: 4200/4781
Progress: 4210/4781
Progress: 4220/4781
Progress: 4230/4781
Progress: 4240/4781
Progress: 4250/4781
Progress: 4260/4781
Progress: 4270/4781
Progress: 4280/4781
Progress: 4290/4781
Progress: 4300/4781
Progress: 4310/4781
Progress: 4320/4781
Progress: 4330/4781
Progress: 4340/4781
Progress: 4350/4781
Progress: 4360/4781
Progress: 4370/4781
Progress: 4380/4781
Progress: 4390/4781
Progress: 4400/4781
Progress: 4410/4781
Progress: 4420/4781
Progress: 4430/4781
Progress: 4440/4781
Progress: 4450/4781
Progress: 4460/4781
Progress: 4470/4781
Progress: 4480/4781
Progress: 4490/4781
Progress: 4500/4781
Progress: 4510/4781
Progress: 4520/4781
Progress: 4530/4781
Progress: 4540/4781
Progress: 4550/4781
Progress: 4560/4781
Progress: 4570/4781
Progress: 4580/4781
Progress: 4590/4781
Progress: 4600/4781
Progress: 4610/4781
Progress: 4620/4781
Progress: 4630/4781
Progress: 4640/4781
Progress: 4650/4781
Progress: 4660/4781
Progress: 4670/4781
Progress: 4680/4781
Progress: 4690/4781
Progress: 4700/4781
Progress: 4710/4781
Progress: 4720/4781
Progress: 4730/4781
Progress: 4740/4781
Progress: 4750/4781
Progress: 4760/4781
Progress: 4770/4781
Progress: 4780/4781
============================================================
Saved indices, IDs, distances, correlations in data/tmp/dup_detection/
============================================================
Detected 0 possible duplicates in tmp.
============================================================
============================================================
Indices: 
IDs: 
============================================================
SUCCESS!! NO MORE DUPLICATES DETECTED!!

In [18]:

Copied!

df_dupfree = tmp_df_dupfree
print(df_dupfree.info())
df_dupfree = tmp_df_dupfree
print(df_dupfree.info())

<class 'pandas.core.frame.DataFrame'>
Index: 4781 entries, pages2k_5 to dod2k_composite_z_FE23_europe_swed019w_FE23_europe_swed021w
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    4781 non-null   object 
 1   dataSetName                    4781 non-null   object 
 2   geo_meanElev                   4699 non-null   float32
 3   geo_meanLat                    4781 non-null   float32
 4   geo_meanLon                    4781 non-null   float32
 5   geo_siteName                   4781 non-null   object 
 6   interpretation_direction       4781 non-null   object 
 7   interpretation_seasonality     4781 non-null   object 
 8   interpretation_variable        4781 non-null   object 
 9   interpretation_variableDetail  4781 non-null   object 
 10  originalDataURL                4781 non-null   object 
 11  originalDatabase               4781 non-null   object 
 12  paleoData_notes                4781 non-null   object 
 13  paleoData_proxy                4781 non-null   object 
 14  paleoData_sensorSpecies        4780 non-null   object 
 15  paleoData_units                4781 non-null   object 
 16  paleoData_values               4781 non-null   object 
 17  paleoData_variableName         4780 non-null   object 
 18  year                           4781 non-null   object 
 19  yearUnits                      4781 non-null   object 
 20  duplicateDetails               4781 non-null   object 
 21  datasetId                      4781 non-null   object 
dtypes: float32(3), object(19)
memory usage: 803.1+ KB
None

Save duplicate free dataframe¶

Sort the columns and assign a name to the dataframe which is used for saving purposes (determines directory and filename). Make sure that date and operator initials initials are used in the name.

In [19]:

Copied!





df_dupfree = df_dupfree[sorted(df_dupfree.columns)]

if df.name=='all_merged':
    yn = input('Would you like to save as dod2k_v2.0?')
if yn=='y':
    df_dupfree.name = 'dod2k_v2.0'
else:
    df_dupfree.name =f'{df.name}_{initials}_{date}_dupfree'
os.makedirs(f'data/{df_dupfree.name}/', exist_ok=True)
df_dupfree = df_dupfree[sorted(df_dupfree.columns)]

if df.name=='all_merged':
    yn = input('Would you like to save as dod2k_v2.0?')
if yn=='y':
    df_dupfree.name = 'dod2k_v2.0'
else:
    df_dupfree.name =f'{df.name}_{initials}_{date}_dupfree'
os.makedirs(f'data/{df_dupfree.name}/', exist_ok=True)

In [20]:

Copied!

df_dupfree.info()
print(df_dupfree.name)
df_dupfree.info()
print(df_dupfree.name)

<class 'pandas.core.frame.DataFrame'>
Index: 4781 entries, pages2k_5 to dod2k_composite_z_FE23_europe_swed019w_FE23_europe_swed021w
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    4781 non-null   object 
 1   dataSetName                    4781 non-null   object 
 2   datasetId                      4781 non-null   object 
 3   duplicateDetails               4781 non-null   object 
 4   geo_meanElev                   4699 non-null   float32
 5   geo_meanLat                    4781 non-null   float32
 6   geo_meanLon                    4781 non-null   float32
 7   geo_siteName                   4781 non-null   object 
 8   interpretation_direction       4781 non-null   object 
 9   interpretation_seasonality     4781 non-null   object 
 10  interpretation_variable        4781 non-null   object 
 11  interpretation_variableDetail  4781 non-null   object 
 12  originalDataURL                4781 non-null   object 
 13  originalDatabase               4781 non-null   object 
 14  paleoData_notes                4781 non-null   object 
 15  paleoData_proxy                4781 non-null   object 
 16  paleoData_sensorSpecies        4780 non-null   object 
 17  paleoData_units                4781 non-null   object 
 18  paleoData_values               4781 non-null   object 
 19  paleoData_variableName         4780 non-null   object 
 20  year                           4781 non-null   object 
 21  yearUnits                      4781 non-null   object 
dtypes: float32(3), object(19)
memory usage: 803.1+ KB
dod2k_v2.0

save pickle¶

In [21]:

Copied!

# save concatenate dataframe as db_merged
df_dupfree.to_pickle(f'data/{df_dupfree.name}/{df_dupfree.name}_compact.pkl')
# save concatenate dataframe as db_merged
df_dupfree.to_pickle(f'data/{df_dupfree.name}/{df_dupfree.name}_compact.pkl')

save csv¶

In [22]:

Copied!

# save to a list of csv files (metadata, data, year)
utf.write_compact_dataframe_to_csv(df_dupfree)
# save to a list of csv files (metadata, data, year)
utf.write_compact_dataframe_to_csv(df_dupfree)

METADATA: datasetId, archiveType, dataSetName, duplicateDetails, geo_meanElev, geo_meanLat, geo_meanLon, geo_siteName, interpretation_direction, interpretation_seasonality, interpretation_variable, interpretation_variableDetail, originalDataURL, originalDatabase, paleoData_notes, paleoData_proxy, paleoData_sensorSpecies, paleoData_units, paleoData_variableName, yearUnits
Saved to /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_%s.csv

In [23]:

Copied!

# load dataframe
print(utf.load_compact_dataframe_from_csv(df_dupfree.name).info())
# load dataframe
print(utf.load_compact_dataframe_from_csv(df_dupfree.name).info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4781 entries, 0 to 4780
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   archiveType                    4781 non-null   object 
 1   dataSetName                    4781 non-null   object 
 2   datasetId                      4781 non-null   object 
 3   duplicateDetails               4781 non-null   object 
 4   geo_meanElev                   4699 non-null   float32
 5   geo_meanLat                    4781 non-null   float32
 6   geo_meanLon                    4781 non-null   float32
 7   geo_siteName                   4781 non-null   object 
 8   interpretation_direction       4781 non-null   object 
 9   interpretation_seasonality     4781 non-null   object 
 10  interpretation_variable        4781 non-null   object 
 11  interpretation_variableDetail  4781 non-null   object 
 12  originalDataURL                4781 non-null   object 
 13  originalDatabase               4781 non-null   object 
 14  paleoData_notes                4781 non-null   object 
 15  paleoData_proxy                4781 non-null   object 
 16  paleoData_sensorSpecies        4781 non-null   object 
 17  paleoData_units                4781 non-null   object 
 18  paleoData_values               4781 non-null   object 
 19  paleoData_variableName         4781 non-null   object 
 20  year                           4781 non-null   object 
 21  yearUnits                      4781 non-null   object 
dtypes: float32(3), object(19)
memory usage: 765.8+ KB
None

In [24]:

Copied!





# write header with operator information as README txt file
file = open(f'data/{df_dupfree.name}/{df_dupfree.name}_dupfree_README.txt', 'w')
for line in header:
    file.write(line+'\n')
file.close()
# write header with operator information as README txt file
file = open(f'data/{df_dupfree.name}/{df_dupfree.name}_dupfree_README.txt', 'w')
for line in header:
    file.write(line+'\n')
file.close()

In [25]:

Copied!





fn = utf.find(df_dupfree.name, f'data/{df_dupfree.name}')
print(fn)
if fn != []:
    print('----------------------------------------------------')
    print('Sucessfully finished the duplicate finalising process!'.upper())
    print('----------------------------------------------------')
    print('Saved the final output files in:')
    print()
    for ff in fn:
        print('   '+os.getcwd()+'/%s.'%ff)
    print()
    print('The duplicate detection process is now finished and the duplicate free database is ready to use!')
else:
    print('!!!!!!!!!!!!WARNING!!!!!!!!!!!')
    print(f'Final output file is missing at data/{df_dupfree.name}.')
    print()
    print('Please re-run the notebook to complete duplicate finalising process.')
fn = utf.find(df_dupfree.name, f'data/{df_dupfree.name}')
print(fn)
if fn != []:
    print('----------------------------------------------------')
    print('Sucessfully finished the duplicate finalising process!'.upper())
    print('----------------------------------------------------')
    print('Saved the final output files in:')
    print()
    for ff in fn:
        print('   '+os.getcwd()+'/%s.'%ff)
    print()
    print('The duplicate detection process is now finished and the duplicate free database is ready to use!')
else:
    print('!!!!!!!!!!!!WARNING!!!!!!!!!!!')
    print(f'Final output file is missing at data/{df_dupfree.name}.')
    print()
    print('Please re-run the notebook to complete duplicate finalising process.')

['data/dod2k_v2.0/dod2k_v2.0_compact.pkl', 'data/dod2k_v2.0/dod2k_v2.0_compact_metadata.csv', 'data/dod2k_v2.0/dod2k_v2.0_compact_year.csv', 'data/dod2k_v2.0/dod2k_v2.0_compact_paleoData_values.csv', 'data/dod2k_v2.0/dod2k_v2.0_README.txt', 'data/dod2k_v2.0/dod2k_v2.0_dupfree_README.txt']
----------------------------------------------------
SUCESSFULLY FINISHED THE DUPLICATE FINALISING PROCESS!
----------------------------------------------------
Saved the final output files in:

   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact.pkl.
   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_metadata.csv.
   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_year.csv.
   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_compact_paleoData_values.csv.
   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_README.txt.
   /home/jupyter-lluecke/dod2k/data/dod2k_v2.0/dod2k_v2.0_dupfree_README.txt.

The duplicate detection process is now finished and the duplicate free database is ready to use!

Summary and summary plots of datasets¶

Import plotting libraries

In [26]:

Copied!





import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec as GS
import cartopy.crs as ccrs
import cartopy.feature as cfeature

from dod2k_utilities import ut_plot as uplt # contains plotting functions
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec as GS
import cartopy.crs as ccrs
import cartopy.feature as cfeature

from dod2k_utilities import ut_plot as uplt # contains plotting functions

In [27]:

Copied!





#%% print some info about the data
db_types = df_cleaned['originalDatabase'].unique()

col = uplt.get_colours(range(len(db_types)), 'tab10', 0, len(db_types))
#col = ['tab:blue','tab:green', 'tab:grey', 'tab:pink', 'tab:orange']
counts      = []
ticks       = []
colours     = []
for ii, db in enumerate(db_types):
    cc = df_cleaned['originalDatabase'][(df_cleaned['originalDatabase']==db)].count()
    counts  += [cc]
    ticks   += [db.split('(Ocn_103')[0]]
    colours += [col[ii]]
#%% print some info about the data
db_types = df_cleaned['originalDatabase'].unique()

col = uplt.get_colours(range(len(db_types)), 'tab10', 0, len(db_types))
#col = ['tab:blue','tab:green', 'tab:grey', 'tab:pink', 'tab:orange']
counts      = []
ticks       = []
colours     = []
for ii, db in enumerate(db_types):
    cc = df_cleaned['originalDatabase'][(df_cleaned['originalDatabase']==db)].count()
    counts  += [cc]
    ticks   += [db.split('(Ocn_103')[0]]
    colours += [col[ii]]

In [28]:

Copied!





# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(8,4), dpi=200)
ax  = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('database')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
plt.title('original database')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_databases', add='%s/'%df_dupfree.name)
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(8,4), dpi=200)
ax  = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('database')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
plt.title('original database')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_databases', add='%s/'%df_dupfree.name)

saved figure in /figs/dod2k_v2.0//SF_removed_recs_barchart_databases.pdf

In [29]:

Copied!





#%% print some info about the data
proxy_types   = df_cleaned['paleoData_proxy'].unique()
archive_types = df_cleaned['archiveType'].unique()
print(proxy_types)
print(archive_types)
col = uplt.get_colours(range(0,len(archive_types)), 'Accent', -1, len(archive_types))
counts      = []
ticks       = []
colours     = []
for ii, at in enumerate(archive_types):
    proxy_types   = df_cleaned['paleoData_proxy'][df_cleaned['archiveType']==at].unique()
    for pt in proxy_types:
        cc = df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count()
        # print('%25s'%pt+': '+str(cc))
        counts  += [cc]
        ticks   += [at+': '+pt]
        colours += [col[ii]]
        
#%% print some info about the data
proxy_types   = df_cleaned['paleoData_proxy'].unique()
archive_types = df_cleaned['archiveType'].unique()
print(proxy_types)
print(archive_types)
col = uplt.get_colours(range(0,len(archive_types)), 'Accent', -1, len(archive_types))
counts      = []
ticks       = []
colours     = []
for ii, at in enumerate(archive_types):
    proxy_types   = df_cleaned['paleoData_proxy'][df_cleaned['archiveType']==at].unique()
    for pt in proxy_types:
        cc = df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count()
        # print('%25s'%pt+': '+str(cc))
        counts  += [cc]
        ticks   += [at+': '+pt]
        colours += [col[ii]]
        

['ring width' 'ARSTAN' 'maximum latewood density' 'reflectance' 'd18O'
 'd13C' 'Sr/Ca' 'Mg/Ca' 'temperature' 'historical' 'varve thickness'
 'ice melt' 'alkenone' 'chironomid' 'Uk37' 'borehole' 'pollen' 'dinocyst'
 'count' 'concentration' 'chrysophyte assemblage' 'calcification rate'
 'foraminifera' 'dust' 'chloride' 'sulfate' 'nitrate' 'thickness' 'dD'
 'TEX86' 'effective precipitation' 'diatom' 'multiproxy'
 'humidification index' 'accumulation rate' 'sodium' 'growth rate']
['Wood' 'Coral' 'LakeSediment' 'MarineSediment' 'Documents' 'GlacierIce'
 'Borehole' 'Sclerosponge' 'Speleothem' 'Other' 'GroundIce' 'MolluskShell']

In [30]:

Copied!





# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(12, 6), dpi=150)
ax  = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)

plt.xlabel('proxy type')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
ax.set_yscale('log')
plt.title('removed proxy types')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_proxytypes', add='%s/'%df_dupfree.name)
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(12, 6), dpi=150)
ax  = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)

plt.xlabel('proxy type')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
ax.set_yscale('log')
plt.title('removed proxy types')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_proxytypes', add='%s/'%df_dupfree.name)

saved figure in /figs/dod2k_v2.0//SF_removed_recs_barchart_proxytypes.pdf

In [31]:

Copied!





#%% plot the spatial distribution of the removeed records
proxy_lats = df_cleaned['geo_meanLat'].values
proxy_lons = df_cleaned['geo_meanLon'].values

# plots the map
fig = plt.figure(figsize=(10, 5), dpi=200)
grid = GS(1, 3)

ax = plt.subplot(grid[:, -2:], projection=ccrs.Robinson()) # create axis with Robinson projection of globe
ax.stock_img()
ax.add_feature(cfeature.LAND) # adds land features
ax.coastlines() # adds coastline features


mt = 'ov^<>pP*XDd'*10 # generates string of marker types
archive_marker = {aa: mm for aa, mm in zip(archive_types, mt)} # attributes marker type to each archive type
archive_colour = {aa: cc for aa, cc in zip(archive_types, col)}

# loop through the data to generate a scatter plot of each data record:
# 1st loop: go through archive types individually (determines marker type)
# 2nd loop: through paleo proxy types attributed to the specific archive, which is colour coded
for jj, at in enumerate(archive_types):
    arch_mask = df_cleaned['archiveType']==at
    arch_proxy_types = np.unique(df_cleaned['paleoData_proxy'][arch_mask])
    for ii, pt in enumerate(arch_proxy_types):
        pt_mask = df_cleaned['paleoData_proxy']==pt
        at_mask = df_cleaned['archiveType']==at
        plt.scatter(proxy_lons[pt_mask&at_mask], proxy_lats[pt_mask&at_mask], 
                    transform=ccrs.PlateCarree(), zorder=999,
                    marker=mt[ii], color=archive_colour[at], 
                    label=at+': '+pt+' ($n=%d$)'% df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count(), 
                    lw=.5, ec='k')

plt.title('removed proxy types')
plt.legend(bbox_to_anchor=(0.03,1.1), ncol=2, fontsize=9, framealpha=0)
grid.tight_layout(fig)
utf.figsave(fig, 'SF_removed_spatial', add='%s/'%df_dupfree.name)
#%% plot the spatial distribution of the removeed records
proxy_lats = df_cleaned['geo_meanLat'].values
proxy_lons = df_cleaned['geo_meanLon'].values

# plots the map
fig = plt.figure(figsize=(10, 5), dpi=200)
grid = GS(1, 3)

ax = plt.subplot(grid[:, -2:], projection=ccrs.Robinson()) # create axis with Robinson projection of globe
ax.stock_img()
ax.add_feature(cfeature.LAND) # adds land features
ax.coastlines() # adds coastline features


mt = 'ov^<>pP*XDd'*10 # generates string of marker types
archive_marker = {aa: mm for aa, mm in zip(archive_types, mt)} # attributes marker type to each archive type
archive_colour = {aa: cc for aa, cc in zip(archive_types, col)}

# loop through the data to generate a scatter plot of each data record:
# 1st loop: go through archive types individually (determines marker type)
# 2nd loop: through paleo proxy types attributed to the specific archive, which is colour coded
for jj, at in enumerate(archive_types):
    arch_mask = df_cleaned['archiveType']==at
    arch_proxy_types = np.unique(df_cleaned['paleoData_proxy'][arch_mask])
    for ii, pt in enumerate(arch_proxy_types):
        pt_mask = df_cleaned['paleoData_proxy']==pt
        at_mask = df_cleaned['archiveType']==at
        plt.scatter(proxy_lons[pt_mask&at_mask], proxy_lats[pt_mask&at_mask], 
                    transform=ccrs.PlateCarree(), zorder=999,
                    marker=mt[ii], color=archive_colour[at], 
                    label=at+': '+pt+' ($n=%d$)'% df_cleaned['paleoData_proxy'][(df_cleaned['paleoData_proxy']==pt)&(df_cleaned['archiveType']==at)].count(), 
                    lw=.5, ec='k')

plt.title('removed proxy types')
plt.legend(bbox_to_anchor=(0.03,1.1), ncol=2, fontsize=9, framealpha=0)
grid.tight_layout(fig)
utf.figsave(fig, 'SF_removed_spatial', add='%s/'%df_dupfree.name)

/tmp/ipykernel_1169431/3409790064.py:36: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations
  grid.tight_layout(fig)

saved figure in /figs/dod2k_v2.0//SF_removed_spatial.pdf

In [ ]: