Duplicate detection - step 1: find the potential duplicates¶
This notebook runs the first part of the duplicate detection algorithm on a dataframe with the following columns:
archiveType(used for duplicate detection algorithm)dataSetNamedatasetIdgeo_meanElev(used for duplicate detection algorithm)geo_meanLat(used for duplicate detection algorithm)geo_meanLon(used for duplicate detection algorithm)geo_siteName(used for duplicate detection algorithm)interpretation_directioninterpretation_seasonalityinterpretation_variableinterpretation_variableDetailsoriginalDataURLoriginalDatabasepaleoData_notespaleoData_proxy(used for duplicate detection algorithm)paleoData_unitspaleoData_values(used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)paleoData_variableNameyear(used for duplicate detection algorithm)yearUnits
The key function for duplicate detection is find_duplicates in f_duplicate_search.py
The output is saved as csvs in the directory data/DATABASENAME/dup_detection, which are used again for step 2 (dup_decisions.py):
pot_dup_correlations_DATABASENAME.csv- matrix of correlations between each pair
pot_dup_distances_km_DATABASENAME.csv- matrix of distances between each pair
pot_dup_IDs_DATABASENAME.csv- saves the IDs of each pair
pot_dup_indices_DATABASENAME.csv- saves the dataframe indices of each pair
Summary figures of the potential duplicate pairs are created and the plots are saved in the same directory, following: duplicatenumber_ID1_ID2_index1_index2.jpg
Updates:
- 06/11/2025 by LL: Tidied up and updated for DoD2k v2.0
- 27/11/2024 by LL: Fixed a bug in find_duplicates (in f_duplicate_search) and relaxed site criteria.
27/9/2024 created by LL
Author: Lucie J. Luecke
Set up working environment¶
Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')
# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
os.chdir(repo_root)
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
print(f"Working directory matches repo root. ")
Repo root: /home/jupyter-lluecke/dod2k Working directory matches repo root.
import pandas as pd
import numpy as np
from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions
Load dataset¶
Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).
The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be
pages2kch2kiso2ksisalfe23
for the individual databases, or
all_merged
to load the merged database of all individual databases, or can be any user defined compact dataframe.
# load dataframe
db_name='all_merged'
# db_name = 'dup_test'
# db_name='ch2k'
df = utf.load_compact_dataframe_from_csv(db_name)
print(df.info())
df.name = db_name
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5147 entries, 0 to 5146 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 5147 non-null object 1 dataSetName 5147 non-null object 2 datasetId 5147 non-null object 3 geo_meanElev 5048 non-null float32 4 geo_meanLat 5147 non-null float32 5 geo_meanLon 5147 non-null float32 6 geo_siteName 5147 non-null object 7 interpretation_direction 5147 non-null object 8 interpretation_seasonality 5147 non-null object 9 interpretation_variable 5147 non-null object 10 interpretation_variableDetail 5147 non-null object 11 originalDataURL 5147 non-null object 12 originalDatabase 5147 non-null object 13 paleoData_notes 5147 non-null object 14 paleoData_proxy 5147 non-null object 15 paleoData_sensorSpecies 5147 non-null object 16 paleoData_units 5147 non-null object 17 paleoData_values 5147 non-null object 18 paleoData_variableName 5147 non-null object 19 year 5147 non-null object 20 yearUnits 5147 non-null object dtypes: float32(3), object(18) memory usage: 784.2+ KB None
Duplicate Detection¶
Find duplicates¶
Now run the first part of the duplicate detection algorithm, which goes through each candidate pair and evaluates the pairs for the following criteria:
- metadata criteria:
- archive types (
archiveType) must be identical - proxy types (
paleoData_proxy) must be identical
- archive types (
- geographical criteria:
- elevation (
geo_meanElev) similar, within defined tolerance (use kwargelevation_tolerance, defaults to 0) - latitude and longtitude (
geo_meanLatandgeo_meanLon) similar, within defined tolerance in km (use kwargdist_tolerance_km, defaults to 8 km)
- elevation (
- overlap criterion:
- time must overlap for at least $n$ points (use kwarg
n_points_threshto modify, defaults to $n=10$) unless at least one of the record is shorter thann_points_thresh
- time must overlap for at least $n$ points (use kwarg
- site criterion:
- there must be some overlap in the site name (
geo_siteName)
- there must be some overlap in the site name (
- correlation criteria:
- correlation between the overlapping period must be greater than defined threshold (use
corr_threshto modify, defaults to 0.9) or correlation of first difference must be greater than defined threshold (usecorr_diff_threshto modify, defaults to 0.9) - RMSE of overlapping period must be smaller than defined threshold (use
rmse_threshto modify, defaults to 0.1) or RMSE of first difference must be smaller than defined threshold (usermse_diff_threshto modify, defaults to 0.1)
- correlation between the overlapping period must be greater than defined threshold (use
- URL criterion:
- URLs (
originalDataURL) must be identical if both records originate from the same database (originalDatabasemust be identical)
- URLs (
A potential duplicate candidate pair is flagged, if all of these criteria are satisfied OR the correlation between the candidates is particularly high (>0.98), while there is sufficient overlap (as defined by the overlap criterion).
The output for a database named DB is saved under data/DB/dup_detection/dup_detection_candidates_DB.csv.
## run the find duplicate algorithm
dup.find_duplicates_optimized(df, n_points_thresh=10)
all_merged Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/5147 --> Found potential duplicate: 0: pages2k_0&4235: iso2k_296 (n_potential_duplicates=1) --> Found potential duplicate: 0: pages2k_0&4236: iso2k_298 (n_potential_duplicates=2) --> Found potential duplicate: 0: pages2k_0&4237: iso2k_299 (n_potential_duplicates=3) --> Found potential duplicate: 2: pages2k_6&2864: fe23_northamerica_usa_az555 (n_potential_duplicates=4) Progress: 10/5147 --> Found potential duplicate: 14: pages2k_50&1414: fe23_northamerica_canada_cana091 (n_potential_duplicates=5) --> Found potential duplicate: 16: pages2k_62&17: pages2k_63 (n_potential_duplicates=6) Progress: 20/5147 --> Found potential duplicate: 24: pages2k_81&3973: ch2k_he08lra01_76 (n_potential_duplicates=7) --> Found potential duplicate: 24: pages2k_81&4563: iso2k_1813 (n_potential_duplicates=8) --> Found potential duplicate: 25: pages2k_83&4594: iso2k_1916 (n_potential_duplicates=9) --> Found potential duplicate: 26: pages2k_85&27: pages2k_88 (n_potential_duplicates=10) --> Found potential duplicate: 29: pages2k_94&1466: fe23_northamerica_canada_cana153 (n_potential_duplicates=11) Progress: 30/5147 --> Found potential duplicate: 32: pages2k_107&2636: fe23_northamerica_usa_ak046 (n_potential_duplicates=12) --> Found potential duplicate: 37: pages2k_121&38: pages2k_122 (n_potential_duplicates=13) Progress: 40/5147 --> Found potential duplicate: 43: pages2k_132&1533: fe23_northamerica_canada_cana225 (n_potential_duplicates=14) --> Found potential duplicate: 49: pages2k_158&3845: fe23_northamerica_usa_wa069 (n_potential_duplicates=15) Progress: 50/5147 --> Found potential duplicate: 52: pages2k_171&3929: fe23_northamerica_usa_wy021 (n_potential_duplicates=16) Progress: 60/5147 --> Found potential duplicate: 60: pages2k_203&4353: iso2k_826 (n_potential_duplicates=17) Progress: 70/5147 --> Found potential duplicate: 70: pages2k_225&3526: fe23_northamerica_usa_nv512 (n_potential_duplicates=18) --> Found potential duplicate: 73: pages2k_238&4403: iso2k_1044 (n_potential_duplicates=19) --> Found potential duplicate: 75: pages2k_242&4155: ch2k_li06fij01_582 (n_potential_duplicates=20) --> Found potential duplicate: 75: pages2k_242&4250: iso2k_353 (n_potential_duplicates=21) Progress: 80/5147 --> Found potential duplicate: 81: pages2k_258&4487: iso2k_1498 (n_potential_duplicates=22) --> Found potential duplicate: 84: pages2k_263&4456: iso2k_1322 (n_potential_duplicates=23) --> Found potential duplicate: 86: pages2k_267&4179: iso2k_58 (n_potential_duplicates=24) --> Found potential duplicate: 86: pages2k_267&4408: iso2k_1068 (n_potential_duplicates=25) --> Found potential duplicate: 88: pages2k_271&4133: ch2k_fe18rus01_492 (n_potential_duplicates=26) --> Found potential duplicate: 88: pages2k_271&4580: iso2k_1861 (n_potential_duplicates=27) --> Found potential duplicate: 89: pages2k_273&2496: fe23_asia_russ130w (n_potential_duplicates=28) Progress: 90/5147 --> Found potential duplicate: 92: pages2k_281&1468: fe23_northamerica_canada_cana155 (n_potential_duplicates=29) --> Found potential duplicate: 95: pages2k_294&2611: fe23_northamerica_usa_ak021 (n_potential_duplicates=30) --> Found potential duplicate: 98: pages2k_305&100: pages2k_309 (n_potential_duplicates=31) --> Found potential duplicate: 99: pages2k_307&101: pages2k_311 (n_potential_duplicates=32) Progress: 100/5147 --> Found potential duplicate: 103: pages2k_315&4252: iso2k_362 (n_potential_duplicates=33) --> Found potential duplicate: 104: pages2k_317&3975: ch2k_na09mal01_84 (n_potential_duplicates=34) --> Found potential duplicate: 104: pages2k_317&4549: iso2k_1754 (n_potential_duplicates=35) --> Found potential duplicate: 106: pages2k_323&1518: fe23_northamerica_canada_cana210 (n_potential_duplicates=36) Progress: 110/5147 Progress: 120/5147 --> Found potential duplicate: 123: pages2k_385&4063: ch2k_fe09oga01_304 (n_potential_duplicates=37) --> Found potential duplicate: 123: pages2k_385&4596: iso2k_1922 (n_potential_duplicates=38) --> Found potential duplicate: 124: pages2k_387&4064: ch2k_fe09oga01_306 (n_potential_duplicates=39) --> Found potential duplicate: 129: pages2k_395&4098: ch2k_ca07fli01_400 (n_potential_duplicates=40) --> Found potential duplicate: 129: pages2k_395&4406: iso2k_1057 (n_potential_duplicates=41) Progress: 130/5147 --> Found potential duplicate: 130: pages2k_397&4099: ch2k_ca07fli01_402 (n_potential_duplicates=42) --> Found potential duplicate: 135: pages2k_409&4107: ch2k_qu96esv01_422 (n_potential_duplicates=43) --> Found potential duplicate: 135: pages2k_409&4213: iso2k_218 (n_potential_duplicates=44) --> Found potential duplicate: 137: pages2k_414&139: pages2k_418 (n_potential_duplicates=45) --> Found potential duplicate: 138: pages2k_417&140: pages2k_421 (n_potential_duplicates=46) Progress: 140/5147 --> Found potential duplicate: 146: pages2k_427&152: pages2k_433 (n_potential_duplicates=47) Progress: 150/5147 --> Found potential duplicate: 154: pages2k_435&287: pages2k_842 (n_potential_duplicates=48) --> Found potential duplicate: 157: pages2k_444&158: pages2k_445 (n_potential_duplicates=49) --> Found potential duplicate: 157: pages2k_444&159: pages2k_446 (n_potential_duplicates=50) --> Found potential duplicate: 158: pages2k_445&159: pages2k_446 (n_potential_duplicates=51) Progress: 160/5147 --> Found potential duplicate: 164: pages2k_462&4035: ch2k_os14ucp01_236 (n_potential_duplicates=52) --> Found potential duplicate: 164: pages2k_462&4249: iso2k_350 (n_potential_duplicates=53) --> Found potential duplicate: 167: pages2k_468&1145: pages2k_3550 (n_potential_duplicates=54) --> Found potential duplicate: 167: pages2k_468&2503: fe23_asia_russ137w (n_potential_duplicates=55) --> Found potential duplicate: 169: pages2k_472&170: pages2k_474 (n_potential_duplicates=56) --> Found potential duplicate: 169: pages2k_472&172: pages2k_477 (n_potential_duplicates=57) Progress: 170/5147 --> Found potential duplicate: 170: pages2k_474&172: pages2k_477 (n_potential_duplicates=58) --> Found potential duplicate: 173: pages2k_478&4571: iso2k_1846 (n_potential_duplicates=59) --> Found potential duplicate: 176: pages2k_486&2984: fe23_northamerica_usa_ca609 (n_potential_duplicates=60) --> Found potential duplicate: 178: pages2k_495&3950: ch2k_li06rar01_12 (n_potential_duplicates=61) --> Found potential duplicate: 178: pages2k_495&4489: iso2k_1502 (n_potential_duplicates=62) Progress: 180/5147 --> Found potential duplicate: 181: pages2k_500&4062: ch2k_as05gua01_302 (n_potential_duplicates=63) --> Found potential duplicate: 181: pages2k_500&4502: iso2k_1559 (n_potential_duplicates=64) Progress: 190/5147 --> Found potential duplicate: 193: pages2k_541&4258: iso2k_404 (n_potential_duplicates=65) --> Found potential duplicate: 194: pages2k_543&330: pages2k_976 (n_potential_duplicates=66) Progress: 200/5147 --> Found potential duplicate: 200: pages2k_565&4395: iso2k_998 (n_potential_duplicates=67) --> Found potential duplicate: 209: pages2k_583&3377: fe23_northamerica_usa_mt116 (n_potential_duplicates=68) Progress: 210/5147 --> Found potential duplicate: 211: pages2k_592&4048: ch2k_li06rar02_270 (n_potential_duplicates=69) --> Found potential duplicate: 211: pages2k_592&4488: iso2k_1500 (n_potential_duplicates=70) --> Found potential duplicate: 217: pages2k_610&4428: iso2k_1199 (n_potential_duplicates=71) Progress: 220/5147 --> Found potential duplicate: 224: pages2k_626&3847: fe23_northamerica_usa_wa071 (n_potential_duplicates=72) Progress: 230/5147 Progress: 240/5147 --> Found potential duplicate: 243: pages2k_691&1385: fe23_northamerica_canada_cana062 (n_potential_duplicates=73) Progress: 250/5147 --> Found potential duplicate: 253: pages2k_730&4256: iso2k_396 (n_potential_duplicates=74) --> Found potential duplicate: 255: pages2k_736&3932: fe23_northamerica_usa_wy024 (n_potential_duplicates=75) Progress: 260/5147 Progress: 270/5147 --> Found potential duplicate: 270: pages2k_800&1542: fe23_northamerica_canada_cana234 (n_potential_duplicates=76) --> Found potential duplicate: 274: pages2k_818&4278: iso2k_488 (n_potential_duplicates=77) --> Found potential duplicate: 279: pages2k_827&281: pages2k_830 (n_potential_duplicates=78) Progress: 280/5147 --> Found potential duplicate: 282: pages2k_831&709: pages2k_2220 (n_potential_duplicates=79) --> Found potential duplicate: 282: pages2k_831&2493: fe23_asia_russ127w (n_potential_duplicates=80) Progress: 290/5147 --> Found potential duplicate: 293: pages2k_857&3736: fe23_northamerica_usa_ut511 (n_potential_duplicates=81) --> Found potential duplicate: 299: pages2k_881&4397: iso2k_1010 (n_potential_duplicates=82) Progress: 300/5147 --> Found potential duplicate: 302: pages2k_893&303: pages2k_895 (n_potential_duplicates=83) --> Found potential duplicate: 302: pages2k_893&305: pages2k_900 (n_potential_duplicates=84) --> Found potential duplicate: 303: pages2k_895&305: pages2k_900 (n_potential_duplicates=85) Progress: 310/5147 --> Found potential duplicate: 316: pages2k_940&4046: ch2k_dr99abr01_264 (n_potential_duplicates=86) --> Found potential duplicate: 316: pages2k_940&4047: ch2k_dr99abr01_266 (n_potential_duplicates=87) --> Found potential duplicate: 316: pages2k_940&4188: iso2k_91 (n_potential_duplicates=88) --> Found potential duplicate: 319: pages2k_945&4191: iso2k_100 (n_potential_duplicates=89) Progress: 320/5147 --> Found potential duplicate: 323: pages2k_960&4321: iso2k_641 (n_potential_duplicates=90) Progress: 330/5147 --> Found potential duplicate: 332: pages2k_982&3594: fe23_northamerica_usa_or042 (n_potential_duplicates=91) --> Found potential duplicate: 337: pages2k_1004&4322: iso2k_644 (n_potential_duplicates=92) Progress: 340/5147 --> Found potential duplicate: 343: pages2k_1026&2862: fe23_northamerica_usa_az553 (n_potential_duplicates=93) --> Found potential duplicate: 348: pages2k_1048&4432: iso2k_1212 (n_potential_duplicates=94) Progress: 350/5147 --> Found potential duplicate: 357: pages2k_1089&3374: fe23_northamerica_usa_mt112 (n_potential_duplicates=95) --> Found potential duplicate: 357: pages2k_1089&3375: fe23_northamerica_usa_mt113 (n_potential_duplicates=96) Progress: 360/5147 --> Found potential duplicate: 364: pages2k_1108&4407: iso2k_1060 (n_potential_duplicates=97) --> Found potential duplicate: 367: pages2k_1116&1478: fe23_northamerica_canada_cana170w (n_potential_duplicates=98) Progress: 370/5147 --> Found potential duplicate: 375: pages2k_1147&3974: ch2k_da06maf01_78 (n_potential_duplicates=99) --> Found potential duplicate: 375: pages2k_1147&3980: ch2k_da06maf02_104 (n_potential_duplicates=100) --> Found potential duplicate: 375: pages2k_1147&4546: iso2k_1748 (n_potential_duplicates=101) --> Found potential duplicate: 378: pages2k_1153&379: pages2k_1156 (n_potential_duplicates=102) --> Found potential duplicate: 378: pages2k_1153&381: pages2k_1160 (n_potential_duplicates=103) --> Found potential duplicate: 379: pages2k_1156&381: pages2k_1160 (n_potential_duplicates=104) Progress: 380/5147 Progress: 390/5147 --> Found potential duplicate: 398: pages2k_1209&3102: fe23_northamerica_usa_co553 (n_potential_duplicates=105) Progress: 400/5147 --> Found potential duplicate: 409: pages2k_1252&1419: fe23_northamerica_canada_cana096 (n_potential_duplicates=106) Progress: 410/5147 --> Found potential duplicate: 414: pages2k_1274&4509: iso2k_1577 (n_potential_duplicates=107) Progress: 420/5147 --> Found potential duplicate: 420: pages2k_1293&4351: iso2k_821 (n_potential_duplicates=108) --> Found potential duplicate: 428: pages2k_1325&3938: fe23_northamerica_usa_wy030 (n_potential_duplicates=109) Progress: 430/5147 --> Found potential duplicate: 436: pages2k_1360&3953: ch2k_ur00mai01_22 (n_potential_duplicates=110) --> Found potential duplicate: 436: pages2k_1360&4189: iso2k_94 (n_potential_duplicates=111) --> Found potential duplicate: 436: pages2k_1360&4190: iso2k_98 (n_potential_duplicates=112) --> Found potential duplicate: 437: pages2k_1362&438: pages2k_1365 (n_potential_duplicates=113) --> Found potential duplicate: 439: pages2k_1370&4516: iso2k_1619 (n_potential_duplicates=114) Progress: 440/5147 Progress: 450/5147 --> Found potential duplicate: 451: pages2k_1420&1435: fe23_northamerica_canada_cana111 (n_potential_duplicates=115) --> Found potential duplicate: 456: pages2k_1442&457: pages2k_1444 (n_potential_duplicates=116) Progress: 460/5147 --> Found potential duplicate: 469: pages2k_1488&517: pages2k_1628 (n_potential_duplicates=117) --> Found potential duplicate: 469: pages2k_1488&3965: ch2k_nu11pal01_52 (n_potential_duplicates=118) --> Found potential duplicate: 469: pages2k_1488&4283: iso2k_505 (n_potential_duplicates=119) --> Found potential duplicate: 469: pages2k_1488&4309: iso2k_579 (n_potential_duplicates=120) Progress: 470/5147 --> Found potential duplicate: 470: pages2k_1490&3966: ch2k_nu11pal01_54 (n_potential_duplicates=121) --> Found potential duplicate: 471: pages2k_1491&4308: iso2k_575 (n_potential_duplicates=122) --> Found potential duplicate: 474: pages2k_1497&4588: iso2k_1885 (n_potential_duplicates=123) --> Found potential duplicate: 477: pages2k_1515&479: pages2k_1519 (n_potential_duplicates=124) Progress: 480/5147 --> Found potential duplicate: 480: pages2k_1520&481: pages2k_1522 (n_potential_duplicates=125) Progress: 490/5147 --> Found potential duplicate: 490: pages2k_1547&4223: iso2k_259 (n_potential_duplicates=126) --> Found potential duplicate: 499: pages2k_1566&1539: fe23_northamerica_canada_cana231 (n_potential_duplicates=127) Progress: 500/5147 --> Found potential duplicate: 508: pages2k_1605&2981: fe23_northamerica_usa_ca606 (n_potential_duplicates=128) Progress: 510/5147 --> Found potential duplicate: 512: pages2k_1619&514: pages2k_1623 (n_potential_duplicates=129) --> Found potential duplicate: 517: pages2k_1628&3965: ch2k_nu11pal01_52 (n_potential_duplicates=130) --> Found potential duplicate: 517: pages2k_1628&4283: iso2k_505 (n_potential_duplicates=131) --> Found potential duplicate: 517: pages2k_1628&4309: iso2k_579 (n_potential_duplicates=132) --> Found potential duplicate: 519: pages2k_1636&3857: fe23_northamerica_usa_wa081 (n_potential_duplicates=133) Progress: 520/5147 Progress: 530/5147 --> Found potential duplicate: 533: pages2k_1686&534: pages2k_1688 (n_potential_duplicates=134) --> Found potential duplicate: 536: pages2k_1692&2248: fe23_asia_mong012 (n_potential_duplicates=135) --> Found potential duplicate: 538: pages2k_1703&4031: ch2k_mo06ped01_226 (n_potential_duplicates=136) --> Found potential duplicate: 538: pages2k_1703&4317: iso2k_629 (n_potential_duplicates=137) Progress: 540/5147 --> Found potential duplicate: 543: pages2k_1712&4331: iso2k_715 (n_potential_duplicates=138) --> Found potential duplicate: 547: pages2k_1720&4510: iso2k_1579 (n_potential_duplicates=139) Progress: 550/5147 --> Found potential duplicate: 553: pages2k_1741&3880: fe23_northamerica_usa_wa104 (n_potential_duplicates=140) --> Found potential duplicate: 555: pages2k_1750&4579: iso2k_1856 (n_potential_duplicates=141) --> Found potential duplicate: 555: pages2k_1750&4795: sisal_294.0_194 (n_potential_duplicates=142) Progress: 560/5147 --> Found potential duplicate: 561: pages2k_1771&4017: ch2k_tu01lai01_192 (n_potential_duplicates=143) Progress: 570/5147 --> Found potential duplicate: 572: pages2k_1804&3268: fe23_northamerica_usa_me010 (n_potential_duplicates=144) Progress: 580/5147 --> Found potential duplicate: 585: pages2k_1859&4039: ch2k_he10gua01_244 (n_potential_duplicates=145) --> Found potential duplicate: 585: pages2k_1859&4542: iso2k_1735 (n_potential_duplicates=146) --> Found potential duplicate: 586: pages2k_1861&4040: ch2k_he10gua01_246 (n_potential_duplicates=147) Progress: 590/5147 --> Found potential duplicate: 591: pages2k_1880&2650: fe23_northamerica_usa_ak060 (n_potential_duplicates=148) --> Found potential duplicate: 594: pages2k_1891&595: pages2k_1893 (n_potential_duplicates=149) Progress: 600/5147 --> Found potential duplicate: 604: pages2k_1918&4192: iso2k_102 (n_potential_duplicates=150) --> Found potential duplicate: 605: pages2k_1920&606: pages2k_1923 (n_potential_duplicates=151) --> Found potential duplicate: 609: pages2k_1932&610: pages2k_1934 (n_potential_duplicates=152) Progress: 610/5147 --> Found potential duplicate: 614: pages2k_1942&3955: ch2k_zi04ifr01_26 (n_potential_duplicates=153) --> Found potential duplicate: 614: pages2k_1942&4222: iso2k_257 (n_potential_duplicates=154) Progress: 620/5147 --> Found potential duplicate: 624: pages2k_1972&625: pages2k_1973 (n_potential_duplicates=155) --> Found potential duplicate: 626: pages2k_1976&628: pages2k_1980 (n_potential_duplicates=156) --> Found potential duplicate: 627: pages2k_1978&629: pages2k_1983 (n_potential_duplicates=157) Progress: 630/5147 --> Found potential duplicate: 630: pages2k_1985&4449: iso2k_1294 (n_potential_duplicates=158) --> Found potential duplicate: 632: pages2k_1989&633: pages2k_1991 (n_potential_duplicates=159) --> Found potential duplicate: 634: pages2k_1994&4043: ch2k_de12anc01_258 (n_potential_duplicates=160) --> Found potential duplicate: 639: pages2k_2013&1420: fe23_northamerica_canada_cana097 (n_potential_duplicates=161) Progress: 640/5147 --> Found potential duplicate: 647: pages2k_2042&3954: ch2k_tu95mad01_24 (n_potential_duplicates=162) --> Found potential duplicate: 647: pages2k_2042&4169: iso2k_20 (n_potential_duplicates=163) Progress: 650/5147 --> Found potential duplicate: 655: pages2k_2059&2648: fe23_northamerica_usa_ak058 (n_potential_duplicates=164) Progress: 660/5147 --> Found potential duplicate: 661: pages2k_2085&1344: fe23_northamerica_canada_cana002 (n_potential_duplicates=165) --> Found potential duplicate: 663: pages2k_2094&4117: ch2k_tu01dep01_450 (n_potential_duplicates=166) --> Found potential duplicate: 663: pages2k_2094&4429: iso2k_1201 (n_potential_duplicates=167) --> Found potential duplicate: 665: pages2k_2098&667: pages2k_2103 (n_potential_duplicates=168) Progress: 670/5147 --> Found potential duplicate: 670: pages2k_2110&3103: fe23_northamerica_usa_co554 (n_potential_duplicates=169) Progress: 680/5147 --> Found potential duplicate: 682: pages2k_2146&684: pages2k_2149 (n_potential_duplicates=170) --> Found potential duplicate: 682: pages2k_2146&685: pages2k_2150 (n_potential_duplicates=171) --> Found potential duplicate: 684: pages2k_2149&685: pages2k_2150 (n_potential_duplicates=172) --> Found potential duplicate: 688: pages2k_2156&1477: fe23_northamerica_canada_cana169w (n_potential_duplicates=173) Progress: 690/5147 Progress: 700/5147 --> Found potential duplicate: 704: pages2k_2214&4519: iso2k_1631 (n_potential_duplicates=174) --> Found potential duplicate: 709: pages2k_2220&2493: fe23_asia_russ127w (n_potential_duplicates=175) Progress: 710/5147 --> Found potential duplicate: 712: pages2k_2226&2243: fe23_asia_mong007w (n_potential_duplicates=176) Progress: 720/5147 --> Found potential duplicate: 728: pages2k_2265&2660: fe23_northamerica_usa_ak070 (n_potential_duplicates=177) Progress: 730/5147 --> Found potential duplicate: 736: pages2k_2287&737: pages2k_2290 (n_potential_duplicates=178) Progress: 740/5147 --> Found potential duplicate: 742: pages2k_2300&4010: ch2k_os14rip01_174 (n_potential_duplicates=179) --> Found potential duplicate: 744: pages2k_2303&2242: fe23_asia_mong006 (n_potential_duplicates=180) --> Found potential duplicate: 747: pages2k_2309&4024: ch2k_we09arr01_208 (n_potential_duplicates=181) --> Found potential duplicate: 748: pages2k_2311&4025: ch2k_we09arr01_210 (n_potential_duplicates=182) Progress: 750/5147 --> Found potential duplicate: 752: pages2k_2319&2704: fe23_northamerica_usa_ak6 (n_potential_duplicates=183) --> Found potential duplicate: 755: pages2k_2339&757: pages2k_2344 (n_potential_duplicates=184) Progress: 760/5147 --> Found potential duplicate: 763: pages2k_2361&3873: fe23_northamerica_usa_wa097 (n_potential_duplicates=185) Progress: 770/5147 --> Found potential duplicate: 774: pages2k_2402&3133: fe23_northamerica_usa_co586 (n_potential_duplicates=186) Progress: 780/5147 --> Found potential duplicate: 781: pages2k_2430&1437: fe23_northamerica_canada_cana113 (n_potential_duplicates=187) Progress: 790/5147 --> Found potential duplicate: 792: pages2k_2473&3930: fe23_northamerica_usa_wy022 (n_potential_duplicates=188) --> Found potential duplicate: 799: pages2k_2500&800: pages2k_2502 (n_potential_duplicates=189) Progress: 800/5147 --> Found potential duplicate: 804: pages2k_2510&4517: iso2k_1626 (n_potential_duplicates=190) --> Found potential duplicate: 806: pages2k_2514&4479: iso2k_1467 (n_potential_duplicates=191) --> Found potential duplicate: 808: pages2k_2517&4420: iso2k_1130 (n_potential_duplicates=192) Progress: 810/5147 --> Found potential duplicate: 813: pages2k_2534&4508: iso2k_1575 (n_potential_duplicates=193) --> Found potential duplicate: 815: pages2k_2538&4581: iso2k_1862 (n_potential_duplicates=194) Progress: 820/5147 --> Found potential duplicate: 822: pages2k_2561&1417: fe23_northamerica_canada_cana094 (n_potential_duplicates=195) --> Found potential duplicate: 828: pages2k_2592&830: pages2k_2596 (n_potential_duplicates=196) --> Found potential duplicate: 829: pages2k_2595&831: pages2k_2599 (n_potential_duplicates=197) Progress: 830/5147 --> Found potential duplicate: 834: pages2k_2604&835: pages2k_2606 (n_potential_duplicates=198) --> Found potential duplicate: 834: pages2k_2604&4484: iso2k_1481 (n_potential_duplicates=199) --> Found potential duplicate: 835: pages2k_2606&4484: iso2k_1481 (n_potential_duplicates=200) --> Found potential duplicate: 836: pages2k_2607&837: pages2k_2609 (n_potential_duplicates=201) --> Found potential duplicate: 836: pages2k_2607&839: pages2k_2612 (n_potential_duplicates=202) --> Found potential duplicate: 837: pages2k_2609&839: pages2k_2612 (n_potential_duplicates=203) Progress: 840/5147 --> Found potential duplicate: 840: pages2k_2613&4480: iso2k_1470 (n_potential_duplicates=204) --> Found potential duplicate: 842: pages2k_2617&4507: iso2k_1573 (n_potential_duplicates=205) Progress: 850/5147 --> Found potential duplicate: 850: pages2k_2634&3229: fe23_northamerica_usa_id013 (n_potential_duplicates=206) --> Found potential duplicate: 856: pages2k_2660&2604: fe23_northamerica_usa_ak014 (n_potential_duplicates=207) Progress: 860/5147 --> Found potential duplicate: 861: pages2k_2677&3931: fe23_northamerica_usa_wy023 (n_potential_duplicates=208) --> Found potential duplicate: 867: pages2k_2703&2683: fe23_northamerica_usa_ak094 (n_potential_duplicates=209) Progress: 870/5147 --> Found potential duplicate: 873: pages2k_2722&1546: fe23_northamerica_canada_cana238 (n_potential_duplicates=210) Progress: 880/5147 --> Found potential duplicate: 881: pages2k_2750&4534: iso2k_1708 (n_potential_duplicates=211) --> Found potential duplicate: 882: pages2k_2752&883: pages2k_2755 (n_potential_duplicates=212) --> Found potential duplicate: 882: pages2k_2752&885: pages2k_2759 (n_potential_duplicates=213) --> Found potential duplicate: 883: pages2k_2755&885: pages2k_2759 (n_potential_duplicates=214) Progress: 890/5147 Progress: 900/5147 --> Found potential duplicate: 900: pages2k_2793&901: pages2k_2795 (n_potential_duplicates=215) --> Found potential duplicate: 901: pages2k_2795&903: pages2k_2798 (n_potential_duplicates=216) --> Found potential duplicate: 902: pages2k_2796&903: pages2k_2798 (n_potential_duplicates=217) Progress: 910/5147 --> Found potential duplicate: 913: pages2k_2830&2213: fe23_northamerica_mexico_mexi020 (n_potential_duplicates=218) --> Found potential duplicate: 916: pages2k_2843&3859: fe23_northamerica_usa_wa083 (n_potential_duplicates=219) Progress: 920/5147 Progress: 930/5147 --> Found potential duplicate: 931: pages2k_2899&932: pages2k_2901 (n_potential_duplicates=220) --> Found potential duplicate: 933: pages2k_2904&934: pages2k_2906 (n_potential_duplicates=221) Progress: 940/5147 --> Found potential duplicate: 940: pages2k_2922&2978: fe23_northamerica_usa_ca603 (n_potential_duplicates=222) --> Found potential duplicate: 949: pages2k_2953&4307: iso2k_573 (n_potential_duplicates=223) Progress: 950/5147 --> Found potential duplicate: 951: pages2k_2959&2233: fe23_northamerica_mexico_mexi043 (n_potential_duplicates=224) --> Found potential duplicate: 956: pages2k_2976&3224: fe23_northamerica_usa_id008 (n_potential_duplicates=225) Progress: 960/5147 --> Found potential duplicate: 962: pages2k_3002&3595: fe23_northamerica_usa_or043 (n_potential_duplicates=226) --> Found potential duplicate: 969: pages2k_3028&970: pages2k_3030 (n_potential_duplicates=227) --> Found potential duplicate: 969: pages2k_3028&972: pages2k_3033 (n_potential_duplicates=228) Progress: 970/5147 --> Found potential duplicate: 970: pages2k_3030&972: pages2k_3033 (n_potential_duplicates=229) --> Found potential duplicate: 974: pages2k_3038&3370: fe23_northamerica_usa_mt108 (n_potential_duplicates=230) Progress: 980/5147 --> Found potential duplicate: 981: pages2k_3064&4327: iso2k_698 (n_potential_duplicates=231) --> Found potential duplicate: 982: pages2k_3068&4143: ch2k_zi14ifr02_522 (n_potential_duplicates=232) --> Found potential duplicate: 982: pages2k_3068&4144: ch2k_zi14ifr02_524 (n_potential_duplicates=233) --> Found potential duplicate: 987: pages2k_3085&4000: ch2k_ku00nin01_150 (n_potential_duplicates=234) --> Found potential duplicate: 987: pages2k_3085&4499: iso2k_1554 (n_potential_duplicates=235) --> Found potential duplicate: 987: pages2k_3085&4500: iso2k_1556 (n_potential_duplicates=236) Progress: 990/5147 --> Found potential duplicate: 994: pages2k_3107&3101: fe23_northamerica_usa_co552 (n_potential_duplicates=237) --> Found potential duplicate: 995: pages2k_3108&3101: fe23_northamerica_usa_co552 (n_potential_duplicates=238) Progress: 1000/5147 --> Found potential duplicate: 1001: pages2k_3132&3997: ch2k_qu06rab01_144 (n_potential_duplicates=239) --> Found potential duplicate: 1001: pages2k_3132&4453: iso2k_1311 (n_potential_duplicates=240) --> Found potential duplicate: 1002: pages2k_3134&3998: ch2k_qu06rab01_146 (n_potential_duplicates=241) Progress: 1010/5147 --> Found potential duplicate: 1014: pages2k_3170&2351: fe23_australia_newz062 (n_potential_duplicates=242) --> Found potential duplicate: 1017: pages2k_3179&2647: fe23_northamerica_usa_ak057 (n_potential_duplicates=243) --> Found potential duplicate: 1019: pages2k_3188&1020: pages2k_3191 (n_potential_duplicates=244) Progress: 1020/5147 --> Found potential duplicate: 1021: pages2k_3196&2247: fe23_asia_mong011 (n_potential_duplicates=245) --> Found potential duplicate: 1024: pages2k_3202&4539: iso2k_1727 (n_potential_duplicates=246) Progress: 1030/5147 --> Found potential duplicate: 1033: pages2k_3234&1034: pages2k_3236 (n_potential_duplicates=247) --> Found potential duplicate: 1033: pages2k_3234&1036: pages2k_3239 (n_potential_duplicates=248) --> Found potential duplicate: 1034: pages2k_3236&1036: pages2k_3239 (n_potential_duplicates=249) --> Found potential duplicate: 1039: pages2k_3243&4166: iso2k_0 (n_potential_duplicates=250) Progress: 1040/5147 --> Found potential duplicate: 1046: pages2k_3263&4440: iso2k_1264 (n_potential_duplicates=251) --> Found potential duplicate: 1048: pages2k_3266&4096: ch2k_go12sbv01_396 (n_potential_duplicates=252) --> Found potential duplicate: 1048: pages2k_3266&4365: iso2k_870 (n_potential_duplicates=253) Progress: 1050/5147 Progress: 1060/5147 --> Found potential duplicate: 1063: pages2k_3307&4244: iso2k_339 (n_potential_duplicates=254) --> Found potential duplicate: 1065: pages2k_3313&2936: fe23_northamerica_usa_ca560 (n_potential_duplicates=255) Progress: 1070/5147 --> Found potential duplicate: 1071: pages2k_3337&1073: pages2k_3342 (n_potential_duplicates=256) --> Found potential duplicate: 1077: pages2k_3352&4128: ch2k_zi14tur01_480 (n_potential_duplicates=257) --> Found potential duplicate: 1077: pages2k_3352&4129: ch2k_zi14tur01_482 (n_potential_duplicates=258) --> Found potential duplicate: 1077: pages2k_3352&4239: iso2k_302 (n_potential_duplicates=259) Progress: 1080/5147 --> Found potential duplicate: 1087: pages2k_3372&4087: ch2k_ki04mcv01_366 (n_potential_duplicates=260) --> Found potential duplicate: 1087: pages2k_3372&4203: iso2k_155 (n_potential_duplicates=261) --> Found potential duplicate: 1088: pages2k_3374&4088: ch2k_ki04mcv01_368 (n_potential_duplicates=262) Progress: 1090/5147 --> Found potential duplicate: 1099: pages2k_3404&1355: fe23_northamerica_canada_cana029 (n_potential_duplicates=263) Progress: 1100/5147 --> Found potential duplicate: 1103: pages2k_3417&1104: pages2k_3419 (n_potential_duplicates=264) Progress: 1110/5147 Progress: 1120/5147 Progress: 1130/5147 --> Found potential duplicate: 1131: pages2k_3503&3848: fe23_northamerica_usa_wa072 (n_potential_duplicates=265) --> Found potential duplicate: 1138: pages2k_3524&2600: fe23_northamerica_usa_ak010 (n_potential_duplicates=266) Progress: 1140/5147 --> Found potential duplicate: 1145: pages2k_3550&2503: fe23_asia_russ137w (n_potential_duplicates=267) --> Found potential duplicate: 1146: pages2k_3552&4511: iso2k_1581 (n_potential_duplicates=268) --> Found potential duplicate: 1147: pages2k_3554&4112: ch2k_li94sec01_436 (n_potential_duplicates=269) --> Found potential duplicate: 1147: pages2k_3554&4419: iso2k_1124 (n_potential_duplicates=270) Progress: 1150/5147 --> Found potential duplicate: 1152: pages2k_3571&4204: iso2k_174 (n_potential_duplicates=271) --> Found potential duplicate: 1156: pages2k_3583&3180: fe23_northamerica_usa_co633 (n_potential_duplicates=272) Progress: 1160/5147 --> Found potential duplicate: 1161: pages2k_3599&4409: iso2k_1069 (n_potential_duplicates=273) --> Found potential duplicate: 1161: pages2k_3599&4528: iso2k_1660 (n_potential_duplicates=274) --> Found potential duplicate: 1166: pages2k_3609&1376: fe23_northamerica_canada_cana053 (n_potential_duplicates=275) Progress: 1170/5147 --> Found potential duplicate: 1171: pages2k_3631&4495: iso2k_1530 (n_potential_duplicates=276) --> Found potential duplicate: 1175: pages2k_3642&3933: fe23_northamerica_usa_wy025 (n_potential_duplicates=277) Progress: 1180/5147 Progress: 1190/5147 Progress: 1200/5147 Progress: 1210/5147 --> Found potential duplicate: 1218: fe23_southamerica_arge016&1287: fe23_southamerica_arge085 (n_potential_duplicates=278) Progress: 1220/5147 Progress: 1230/5147 Progress: 1240/5147 Progress: 1250/5147 Progress: 1260/5147 Progress: 1270/5147 Progress: 1280/5147 Progress: 1290/5147 Progress: 1300/5147 Progress: 1310/5147 Progress: 1320/5147 Progress: 1330/5147 Progress: 1340/5147 Progress: 1350/5147 Progress: 1360/5147 Progress: 1370/5147 Progress: 1380/5147 Progress: 1390/5147 Progress: 1400/5147 Progress: 1410/5147 Progress: 1420/5147 --> Found potential duplicate: 1425: fe23_northamerica_canada_cana100&1521: fe23_northamerica_canada_cana213 (n_potential_duplicates=279) Progress: 1430/5147 --> Found potential duplicate: 1430: fe23_northamerica_canada_cana105&1525: fe23_northamerica_canada_cana217 (n_potential_duplicates=280) --> Found potential duplicate: 1439: fe23_northamerica_canada_cana116&1476: fe23_northamerica_canada_cana168w (n_potential_duplicates=281) Progress: 1440/5147 Progress: 1450/5147 Progress: 1460/5147 Progress: 1470/5147 --> Found potential duplicate: 1474: fe23_northamerica_canada_cana161&1475: fe23_northamerica_canada_cana162 (n_potential_duplicates=282) Progress: 1480/5147 Progress: 1490/5147 Progress: 1500/5147 Progress: 1510/5147 Progress: 1520/5147 Progress: 1530/5147 Progress: 1540/5147 Progress: 1550/5147 Progress: 1560/5147 Progress: 1570/5147 Progress: 1580/5147 Progress: 1590/5147 Progress: 1600/5147 Progress: 1610/5147 Progress: 1620/5147 --> Found potential duplicate: 1622: fe23_southamerica_chil016&1623: fe23_southamerica_chil017 (n_potential_duplicates=283) Progress: 1630/5147 Progress: 1640/5147 Progress: 1650/5147 Progress: 1660/5147 Progress: 1670/5147 Progress: 1680/5147 Progress: 1690/5147 Progress: 1700/5147 Progress: 1710/5147 Progress: 1720/5147 Progress: 1730/5147 Progress: 1740/5147 Progress: 1750/5147 Progress: 1760/5147 Progress: 1770/5147 Progress: 1780/5147 Progress: 1790/5147 Progress: 1800/5147 Progress: 1810/5147 Progress: 1820/5147 Progress: 1830/5147 Progress: 1840/5147 Progress: 1850/5147 Progress: 1860/5147 Progress: 1870/5147 Progress: 1880/5147 Progress: 1890/5147 Progress: 1900/5147 Progress: 1910/5147 Progress: 1920/5147 Progress: 1930/5147 Progress: 1940/5147 Progress: 1950/5147 Progress: 1960/5147 Progress: 1970/5147 Progress: 1980/5147 Progress: 1990/5147 Progress: 2000/5147 Progress: 2010/5147 Progress: 2020/5147 Progress: 2030/5147 --> Found potential duplicate: 2035: fe23_europe_swed019w&2037: fe23_europe_swed021w (n_potential_duplicates=284) Progress: 2040/5147 Progress: 2050/5147 Progress: 2060/5147 Progress: 2070/5147 Progress: 2080/5147 Progress: 2090/5147 Progress: 2100/5147 Progress: 2110/5147 Progress: 2120/5147 Progress: 2130/5147 Progress: 2140/5147 Progress: 2150/5147 Progress: 2160/5147 Progress: 2170/5147 Progress: 2180/5147 Progress: 2190/5147 Progress: 2200/5147 Progress: 2210/5147 --> Found potential duplicate: 2215: fe23_northamerica_mexico_mexi022&2216: fe23_northamerica_mexico_mexi023 (n_potential_duplicates=285) Progress: 2220/5147 Progress: 2230/5147 Progress: 2240/5147 Progress: 2250/5147 Progress: 2260/5147 Progress: 2270/5147 Progress: 2280/5147 Progress: 2290/5147 --> Found potential duplicate: 2296: fe23_australia_newz003&2349: fe23_australia_newz060 (n_potential_duplicates=286) Progress: 2300/5147 --> Found potential duplicate: 2300: fe23_australia_newz008&2381: fe23_australia_newz092 (n_potential_duplicates=287) --> Found potential duplicate: 2304: fe23_australia_newz014&2350: fe23_australia_newz061 (n_potential_duplicates=288) --> Found potential duplicate: 2308: fe23_australia_newz018&2351: fe23_australia_newz062 (n_potential_duplicates=289) --> Found potential duplicate: 2309: fe23_australia_newz019&2352: fe23_australia_newz063 (n_potential_duplicates=290) Progress: 2310/5147 Progress: 2320/5147 Progress: 2330/5147 Progress: 2340/5147 Progress: 2350/5147 Progress: 2360/5147 Progress: 2370/5147 Progress: 2380/5147 Progress: 2390/5147 Progress: 2400/5147 Progress: 2410/5147 Progress: 2420/5147 Progress: 2430/5147 Progress: 2440/5147 Progress: 2450/5147 Progress: 2460/5147 Progress: 2470/5147 Progress: 2480/5147 Progress: 2490/5147 Progress: 2500/5147 Progress: 2510/5147 Progress: 2520/5147 Progress: 2530/5147 Progress: 2540/5147 Progress: 2550/5147 Progress: 2560/5147 Progress: 2570/5147 Progress: 2580/5147 Progress: 2590/5147 Progress: 2600/5147 Progress: 2610/5147 Progress: 2620/5147 Progress: 2630/5147 Progress: 2640/5147 Progress: 2650/5147 Progress: 2660/5147 Progress: 2670/5147 Progress: 2680/5147 Progress: 2690/5147 Progress: 2700/5147 Progress: 2710/5147 Progress: 2720/5147 Progress: 2730/5147 Progress: 2740/5147 Progress: 2750/5147 Progress: 2760/5147 Progress: 2770/5147 Progress: 2780/5147 Progress: 2790/5147 Progress: 2800/5147 Progress: 2810/5147 Progress: 2820/5147 Progress: 2830/5147 Progress: 2840/5147 Progress: 2850/5147 Progress: 2860/5147 Progress: 2870/5147 --> Found potential duplicate: 2875: fe23_northamerica_usa_ca066&3003: fe23_northamerica_usa_ca628 (n_potential_duplicates=291) --> Found potential duplicate: 2876: fe23_northamerica_usa_ca067&3003: fe23_northamerica_usa_ca628 (n_potential_duplicates=292) Progress: 2880/5147 Progress: 2890/5147 --> Found potential duplicate: 2894: fe23_northamerica_usa_ca512&2988: fe23_northamerica_usa_ca613 (n_potential_duplicates=293) Progress: 2900/5147 Progress: 2910/5147 --> Found potential duplicate: 2911: fe23_northamerica_usa_ca535&3043: fe23_northamerica_usa_ca670 (n_potential_duplicates=294) Progress: 2920/5147 Progress: 2930/5147 Progress: 2940/5147 Progress: 2950/5147 Progress: 2960/5147 Progress: 2970/5147 Progress: 2980/5147 Progress: 2990/5147 Progress: 3000/5147 Progress: 3010/5147 Progress: 3020/5147 Progress: 3030/5147 Progress: 3040/5147 Progress: 3050/5147 Progress: 3060/5147 Progress: 3070/5147 Progress: 3080/5147 Progress: 3090/5147 Progress: 3100/5147 Progress: 3110/5147 Progress: 3120/5147 Progress: 3130/5147 Progress: 3140/5147 Progress: 3150/5147 Progress: 3160/5147 Progress: 3170/5147 Progress: 3180/5147 Progress: 3190/5147 Progress: 3200/5147 Progress: 3210/5147 Progress: 3220/5147 Progress: 3230/5147 Progress: 3240/5147 Progress: 3250/5147 Progress: 3260/5147 Progress: 3270/5147 --> Found potential duplicate: 3271: fe23_northamerica_usa_me017&3272: fe23_northamerica_usa_me018 (n_potential_duplicates=295) Progress: 3280/5147 Progress: 3290/5147 Progress: 3300/5147 Progress: 3310/5147 Progress: 3320/5147 --> Found potential duplicate: 3326: fe23_northamerica_usa_mo&3335: fe23_northamerica_usa_mo009 (n_potential_duplicates=296) Progress: 3330/5147 Progress: 3340/5147 Progress: 3350/5147 Progress: 3360/5147 Progress: 3370/5147 --> Found potential duplicate: 3374: fe23_northamerica_usa_mt112&3375: fe23_northamerica_usa_mt113 (n_potential_duplicates=297) Progress: 3380/5147 Progress: 3390/5147 Progress: 3400/5147 Progress: 3410/5147 --> Found potential duplicate: 3415: fe23_northamerica_usa_nj001&3416: fe23_northamerica_usa_nj002 (n_potential_duplicates=298) Progress: 3420/5147 --> Found potential duplicate: 3429: fe23_northamerica_usa_nm024&3455: fe23_northamerica_usa_nm055 (n_potential_duplicates=299) Progress: 3430/5147 Progress: 3440/5147 Progress: 3450/5147 Progress: 3460/5147 Progress: 3470/5147 Progress: 3480/5147 Progress: 3490/5147 Progress: 3500/5147 Progress: 3510/5147 --> Found potential duplicate: 3514: fe23_northamerica_usa_nv060&3532: fe23_northamerica_usa_nv518 (n_potential_duplicates=300) Progress: 3520/5147 --> Found potential duplicate: 3526: fe23_northamerica_usa_nv512&3535: fe23_northamerica_usa_nv521 (n_potential_duplicates=301) --> Found potential duplicate: 3527: fe23_northamerica_usa_nv513&3534: fe23_northamerica_usa_nv520 (n_potential_duplicates=302) Progress: 3530/5147 Progress: 3540/5147 Progress: 3550/5147 Progress: 3560/5147 Progress: 3570/5147 Progress: 3580/5147 Progress: 3590/5147 Progress: 3600/5147 Progress: 3610/5147 Progress: 3620/5147 Progress: 3630/5147 Progress: 3640/5147 Progress: 3650/5147 Progress: 3660/5147 Progress: 3670/5147 Progress: 3680/5147 Progress: 3690/5147 Progress: 3700/5147 Progress: 3710/5147 Progress: 3720/5147 Progress: 3730/5147 Progress: 3740/5147 Progress: 3750/5147 Progress: 3760/5147 Progress: 3770/5147 Progress: 3780/5147 Progress: 3790/5147 Progress: 3800/5147 Progress: 3810/5147 Progress: 3820/5147 Progress: 3830/5147 Progress: 3840/5147 Progress: 3850/5147 Progress: 3860/5147 Progress: 3870/5147 Progress: 3880/5147 Progress: 3890/5147 Progress: 3900/5147 Progress: 3910/5147 Progress: 3920/5147 Progress: 3930/5147 Progress: 3940/5147 --> Found potential duplicate: 3946: ch2k_zi15mer01_2&3947: ch2k_zi15mer01_4 (n_potential_duplicates=303) --> Found potential duplicate: 3948: ch2k_co03pal03_6&4286: iso2k_511 (n_potential_duplicates=304) --> Found potential duplicate: 3949: ch2k_co03pal02_8&4285: iso2k_509 (n_potential_duplicates=305) Progress: 3950/5147 --> Found potential duplicate: 3950: ch2k_li06rar01_12&4489: iso2k_1502 (n_potential_duplicates=306) --> Found potential duplicate: 3951: ch2k_co03pal07_14&4291: iso2k_521 (n_potential_duplicates=307) --> Found potential duplicate: 3953: ch2k_ur00mai01_22&4189: iso2k_94 (n_potential_duplicates=308) --> Found potential duplicate: 3953: ch2k_ur00mai01_22&4190: iso2k_98 (n_potential_duplicates=309) --> Found potential duplicate: 3954: ch2k_tu95mad01_24&4169: iso2k_20 (n_potential_duplicates=310) --> Found potential duplicate: 3955: ch2k_zi04ifr01_26&4222: iso2k_257 (n_potential_duplicates=311) --> Found potential duplicate: 3956: ch2k_re18cay01_30&4382: iso2k_917 (n_potential_duplicates=312) Progress: 3960/5147 --> Found potential duplicate: 3960: ch2k_ku99hou01_40&4345: iso2k_786 (n_potential_duplicates=313) --> Found potential duplicate: 3960: ch2k_ku99hou01_40&4346: iso2k_788 (n_potential_duplicates=314) --> Found potential duplicate: 3965: ch2k_nu11pal01_52&4283: iso2k_505 (n_potential_duplicates=315) --> Found potential duplicate: 3965: ch2k_nu11pal01_52&4309: iso2k_579 (n_potential_duplicates=316) --> Found potential duplicate: 3968: ch2k_ca14tim01_64&4272: iso2k_473 (n_potential_duplicates=317) Progress: 3970/5147 --> Found potential duplicate: 3973: ch2k_he08lra01_76&4563: iso2k_1813 (n_potential_duplicates=318) --> Found potential duplicate: 3974: ch2k_da06maf01_78&4546: iso2k_1748 (n_potential_duplicates=319) --> Found potential duplicate: 3975: ch2k_na09mal01_84&4549: iso2k_1754 (n_potential_duplicates=320) --> Found potential duplicate: 3976: ch2k_sw98stp01_86&4176: iso2k_50 (n_potential_duplicates=321) Progress: 3980/5147 --> Found potential duplicate: 3980: ch2k_da06maf02_104&4546: iso2k_1748 (n_potential_duplicates=322) --> Found potential duplicate: 3983: ch2k_co03pal01_110&4284: iso2k_507 (n_potential_duplicates=323) --> Found potential duplicate: 3986: ch2k_ch98pir01_116&4438: iso2k_1229 (n_potential_duplicates=324) Progress: 3990/5147 --> Found potential duplicate: 3991: ch2k_xi17hai01_128&3994: ch2k_xi17hai01_136 (n_potential_duplicates=325) --> Found potential duplicate: 3991: ch2k_xi17hai01_128&4551: iso2k_1762 (n_potential_duplicates=326) --> Found potential duplicate: 3992: ch2k_xi17hai01_130&3993: ch2k_xi17hai01_134 (n_potential_duplicates=327) --> Found potential duplicate: 3994: ch2k_xi17hai01_136&4551: iso2k_1762 (n_potential_duplicates=328) --> Found potential duplicate: 3995: ch2k_de14dto03_140&3999: ch2k_de14dto01_148 (n_potential_duplicates=329) --> Found potential duplicate: 3997: ch2k_qu06rab01_144&4453: iso2k_1311 (n_potential_duplicates=330) Progress: 4000/5147 --> Found potential duplicate: 4000: ch2k_ku00nin01_150&4499: iso2k_1554 (n_potential_duplicates=331) --> Found potential duplicate: 4000: ch2k_ku00nin01_150&4500: iso2k_1556 (n_potential_duplicates=332) Progress: 4010/5147 --> Found potential duplicate: 4014: ch2k_ev18roc01_184&4015: ch2k_ev18roc01_186 (n_potential_duplicates=333) --> Found potential duplicate: 4016: ch2k_ca13sap01_188&4305: iso2k_569 (n_potential_duplicates=334) --> Found potential duplicate: 4018: ch2k_he13mis01_194&4210: iso2k_211 (n_potential_duplicates=335) --> Found potential duplicate: 4018: ch2k_he13mis01_194&4211: iso2k_213 (n_potential_duplicates=336) Progress: 4020/5147 --> Found potential duplicate: 4020: ch2k_zi15imp02_200&4021: ch2k_zi15imp02_202 (n_potential_duplicates=337) --> Found potential duplicate: 4022: ch2k_pf04pba01_204&4532: iso2k_1701 (n_potential_duplicates=338) --> Found potential duplicate: 4022: ch2k_pf04pba01_204&4533: iso2k_1704 (n_potential_duplicates=339) --> Found potential duplicate: 4026: ch2k_co03pal05_212&4288: iso2k_515 (n_potential_duplicates=340) Progress: 4030/5147 --> Found potential duplicate: 4031: ch2k_mo06ped01_226&4317: iso2k_629 (n_potential_duplicates=341) --> Found potential duplicate: 4035: ch2k_os14ucp01_236&4249: iso2k_350 (n_potential_duplicates=342) --> Found potential duplicate: 4039: ch2k_he10gua01_244&4542: iso2k_1735 (n_potential_duplicates=343) Progress: 4040/5147 --> Found potential duplicate: 4046: ch2k_dr99abr01_264&4047: ch2k_dr99abr01_266 (n_potential_duplicates=344) --> Found potential duplicate: 4046: ch2k_dr99abr01_264&4188: iso2k_91 (n_potential_duplicates=345) --> Found potential duplicate: 4047: ch2k_dr99abr01_266&4188: iso2k_91 (n_potential_duplicates=346) --> Found potential duplicate: 4048: ch2k_li06rar02_270&4488: iso2k_1500 (n_potential_duplicates=347) Progress: 4050/5147 --> Found potential duplicate: 4052: ch2k_zi15tan01_278&4053: ch2k_zi15tan01_280 (n_potential_duplicates=348) Progress: 4060/5147 --> Found potential duplicate: 4062: ch2k_as05gua01_302&4502: iso2k_1559 (n_potential_duplicates=349) --> Found potential duplicate: 4063: ch2k_fe09oga01_304&4596: iso2k_1922 (n_potential_duplicates=350) --> Found potential duplicate: 4067: ch2k_gu99nau01_314&4328: iso2k_702 (n_potential_duplicates=351) --> Found potential duplicate: 4067: ch2k_gu99nau01_314&4329: iso2k_705 (n_potential_duplicates=352) Progress: 4070/5147 --> Found potential duplicate: 4070: ch2k_co03pal10_324&4290: iso2k_519 (n_potential_duplicates=353) --> Found potential duplicate: 4072: ch2k_zi15imp01_328&4073: ch2k_zi15imp01_330 (n_potential_duplicates=354) --> Found potential duplicate: 4076: ch2k_ro19yuc01_338&4077: ch2k_ro19yuc01_340 (n_potential_duplicates=355) Progress: 4080/5147 --> Found potential duplicate: 4084: ch2k_co03pal09_358&4293: iso2k_525 (n_potential_duplicates=356) --> Found potential duplicate: 4087: ch2k_ki04mcv01_366&4203: iso2k_155 (n_potential_duplicates=357) Progress: 4090/5147 --> Found potential duplicate: 4091: ch2k_ba04fij02_382&4177: iso2k_52 (n_potential_duplicates=358) --> Found potential duplicate: 4092: ch2k_co03pal06_386&4289: iso2k_517 (n_potential_duplicates=359) --> Found potential duplicate: 4096: ch2k_go12sbv01_396&4365: iso2k_870 (n_potential_duplicates=360) --> Found potential duplicate: 4098: ch2k_ca07fli01_400&4406: iso2k_1057 (n_potential_duplicates=361) Progress: 4100/5147 --> Found potential duplicate: 4101: ch2k_co93tar01_408&4296: iso2k_539 (n_potential_duplicates=362) --> Found potential duplicate: 4103: ch2k_co00mal01_412&4397: iso2k_1010 (n_potential_duplicates=363) --> Found potential duplicate: 4107: ch2k_qu96esv01_422&4213: iso2k_218 (n_potential_duplicates=364) --> Found potential duplicate: 4108: ch2k_de13hai01_424&4111: ch2k_de13hai01_432 (n_potential_duplicates=365) --> Found potential duplicate: 4108: ch2k_de13hai01_424&4523: iso2k_1643 (n_potential_duplicates=366) --> Found potential duplicate: 4109: ch2k_de13hai01_426&4110: ch2k_de13hai01_430 (n_potential_duplicates=367) Progress: 4110/5147 --> Found potential duplicate: 4111: ch2k_de13hai01_432&4523: iso2k_1643 (n_potential_duplicates=368) --> Found potential duplicate: 4112: ch2k_li94sec01_436&4419: iso2k_1124 (n_potential_duplicates=369) --> Found potential duplicate: 4113: ch2k_zi15cle01_438&4114: ch2k_zi15cle01_440 (n_potential_duplicates=370) --> Found potential duplicate: 4117: ch2k_tu01dep01_450&4429: iso2k_1201 (n_potential_duplicates=371) --> Found potential duplicate: 4118: ch2k_co03pal04_452&4287: iso2k_513 (n_potential_duplicates=372) Progress: 4120/5147 --> Found potential duplicate: 4121: ch2k_fl18dto01_460&4151: ch2k_fl18dto02_554 (n_potential_duplicates=373) --> Found potential duplicate: 4124: ch2k_du94urv01_468&4125: ch2k_du94urv01_470 (n_potential_duplicates=374) --> Found potential duplicate: 4126: ch2k_co03pal08_472&4292: iso2k_523 (n_potential_duplicates=375) --> Found potential duplicate: 4128: ch2k_zi14tur01_480&4129: ch2k_zi14tur01_482 (n_potential_duplicates=376) --> Found potential duplicate: 4128: ch2k_zi14tur01_480&4239: iso2k_302 (n_potential_duplicates=377) --> Found potential duplicate: 4129: ch2k_zi14tur01_482&4239: iso2k_302 (n_potential_duplicates=378) Progress: 4130/5147 --> Found potential duplicate: 4130: ch2k_li99cli01_486&4506: iso2k_1571 (n_potential_duplicates=379) --> Found potential duplicate: 4131: ch2k_zi15bun01_488&4132: ch2k_zi15bun01_490 (n_potential_duplicates=380) --> Found potential duplicate: 4133: ch2k_fe18rus01_492&4580: iso2k_1861 (n_potential_duplicates=381) --> Found potential duplicate: 4137: ch2k_wu13ton01_504&4138: ch2k_wu13ton01_506 (n_potential_duplicates=382) --> Found potential duplicate: 4139: ch2k_ki14par01_510&4142: ch2k_ki14par01_518 (n_potential_duplicates=383) Progress: 4140/5147 --> Found potential duplicate: 4140: ch2k_ki14par01_512&4141: ch2k_ki14par01_516 (n_potential_duplicates=384) --> Found potential duplicate: 4143: ch2k_zi14ifr02_522&4144: ch2k_zi14ifr02_524 (n_potential_duplicates=385) Progress: 4150/5147 --> Found potential duplicate: 4152: ch2k_ba04fij01_558&4178: iso2k_55 (n_potential_duplicates=386) --> Found potential duplicate: 4155: ch2k_li06fij01_582&4250: iso2k_353 (n_potential_duplicates=387) Progress: 4160/5147 Progress: 4170/5147 --> Found potential duplicate: 4179: iso2k_58&4408: iso2k_1068 (n_potential_duplicates=388) Progress: 4180/5147 --> Found potential duplicate: 4189: iso2k_94&4190: iso2k_98 (n_potential_duplicates=389) Progress: 4190/5147 --> Found potential duplicate: 4197: iso2k_120&4772: sisal_253.0_171 (n_potential_duplicates=390) Progress: 4200/5147 --> Found potential duplicate: 4202: iso2k_140&4785: sisal_278.0_184 (n_potential_duplicates=391) Progress: 4210/5147 --> Found potential duplicate: 4215: iso2k_236&4742: sisal_205.0_141 (n_potential_duplicates=392) Progress: 4220/5147 Progress: 4230/5147 --> Found potential duplicate: 4235: iso2k_296&4236: iso2k_298 (n_potential_duplicates=393) --> Found potential duplicate: 4235: iso2k_296&4237: iso2k_299 (n_potential_duplicates=394) --> Found potential duplicate: 4236: iso2k_298&4237: iso2k_299 (n_potential_duplicates=395) Progress: 4240/5147 Progress: 4250/5147 --> Found potential duplicate: 4255: iso2k_380&4893: sisal_446.0_292 (n_potential_duplicates=396) --> Found potential duplicate: 4257: iso2k_399&4348: iso2k_806 (n_potential_duplicates=397) --> Found potential duplicate: 4257: iso2k_399&4349: iso2k_811 (n_potential_duplicates=398) Progress: 4260/5147
/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
Progress: 4270/5147 Progress: 4280/5147 --> Found potential duplicate: 4283: iso2k_505&4309: iso2k_579 (n_potential_duplicates=399) Progress: 4290/5147 --> Found potential duplicate: 4295: iso2k_533&4670: sisal_115.0_69 (n_potential_duplicates=400) --> Found potential duplicate: 4298: iso2k_546&4300: iso2k_549 (n_potential_duplicates=401) --> Found potential duplicate: 4299: iso2k_547&4301: iso2k_550 (n_potential_duplicates=402) Progress: 4300/5147 Progress: 4310/5147 Progress: 4320/5147 --> Found potential duplicate: 4328: iso2k_702&4329: iso2k_705 (n_potential_duplicates=403) Progress: 4330/5147 Progress: 4340/5147 --> Found potential duplicate: 4341: iso2k_772&4342: iso2k_775 (n_potential_duplicates=404) --> Found potential duplicate: 4345: iso2k_786&4346: iso2k_788 (n_potential_duplicates=405) --> Found potential duplicate: 4348: iso2k_806&4349: iso2k_811 (n_potential_duplicates=406) Progress: 4350/5147 Progress: 4360/5147 --> Found potential duplicate: 4366: iso2k_873&4915: sisal_471.0_314 (n_potential_duplicates=407) Progress: 4370/5147 Progress: 4380/5147 Progress: 4390/5147 Progress: 4400/5147 --> Found potential duplicate: 4409: iso2k_1069&4528: iso2k_1660 (n_potential_duplicates=408) Progress: 4410/5147 --> Found potential duplicate: 4415: iso2k_1107&4564: iso2k_1817 (n_potential_duplicates=409) --> Found potential duplicate: 4415: iso2k_1107&4775: sisal_271.0_174 (n_potential_duplicates=410) Progress: 4420/5147 --> Found potential duplicate: 4426: iso2k_1178&4734: sisal_201.0_133 (n_potential_duplicates=411) Progress: 4430/5147 Progress: 4440/5147 --> Found potential duplicate: 4444: iso2k_1283&4445: iso2k_1286 (n_potential_duplicates=412) --> Found potential duplicate: 4447: iso2k_1288&4814: sisal_329.0_213 (n_potential_duplicates=413) --> Found potential duplicate: 4448: iso2k_1291&4816: sisal_330.0_215 (n_potential_duplicates=414) Progress: 4450/5147 Progress: 4460/5147 Progress: 4470/5147 Progress: 4480/5147 --> Found potential duplicate: 4486: iso2k_1495&4800: sisal_305.0_199 (n_potential_duplicates=415) Progress: 4490/5147 --> Found potential duplicate: 4490: iso2k_1504&4667: sisal_113.0_66 (n_potential_duplicates=416) --> Found potential duplicate: 4499: iso2k_1554&4500: iso2k_1556 (n_potential_duplicates=417) Progress: 4500/5147 Progress: 4510/5147 Progress: 4520/5147 Progress: 4530/5147 --> Found potential duplicate: 4532: iso2k_1701&4533: iso2k_1704 (n_potential_duplicates=418) Progress: 4540/5147 Progress: 4550/5147 Progress: 4560/5147 --> Found potential duplicate: 4564: iso2k_1817&4775: sisal_271.0_174 (n_potential_duplicates=419) --> Found potential duplicate: 4565: iso2k_1820&4778: sisal_272.0_177 (n_potential_duplicates=420) --> Found potential duplicate: 4566: iso2k_1823&4780: sisal_273.0_179 (n_potential_duplicates=421) Progress: 4570/5147 --> Found potential duplicate: 4572: iso2k_1848&4578: iso2k_1855 (n_potential_duplicates=422) --> Found potential duplicate: 4573: iso2k_1850&4574: iso2k_1851 (n_potential_duplicates=423) --> Found potential duplicate: 4579: iso2k_1856&4795: sisal_294.0_194 (n_potential_duplicates=424) Progress: 4580/5147 Progress: 4590/5147 Progress: 4600/5147 Progress: 4610/5147 --> Found potential duplicate: 4619: sisal_46.0_18&4622: sisal_47.0_21 (n_potential_duplicates=425) Progress: 4620/5147 --> Found potential duplicate: 4620: sisal_46.0_19&4623: sisal_47.0_22 (n_potential_duplicates=426) --> Found potential duplicate: 4621: sisal_46.0_20&4624: sisal_47.0_23 (n_potential_duplicates=427) Progress: 4630/5147 Progress: 4640/5147 Progress: 4650/5147 Progress: 4660/5147 Progress: 4670/5147 Progress: 4680/5147 Progress: 4690/5147 Progress: 4700/5147 Progress: 4710/5147 Progress: 4720/5147 Progress: 4730/5147 Progress: 4740/5147 Progress: 4750/5147 Progress: 4760/5147 Progress: 4770/5147 Progress: 4780/5147 Progress: 4790/5147 Progress: 4800/5147 Progress: 4810/5147 Progress: 4820/5147 Progress: 4830/5147 Progress: 4840/5147 Progress: 4850/5147 Progress: 4860/5147 Progress: 4870/5147 --> Found potential duplicate: 4871: sisal_430.0_270&5132: sisal_896.0_531 (n_potential_duplicates=428) --> Found potential duplicate: 4872: sisal_430.0_271&5134: sisal_896.0_533 (n_potential_duplicates=429) Progress: 4880/5147 Progress: 4890/5147 Progress: 4900/5147 Progress: 4910/5147 Progress: 4920/5147 Progress: 4930/5147 Progress: 4940/5147 Progress: 4950/5147 Progress: 4960/5147 Progress: 4970/5147 Progress: 4980/5147 Progress: 4990/5147 Progress: 5000/5147 Progress: 5010/5147 Progress: 5020/5147 Progress: 5030/5147 Progress: 5040/5147 Progress: 5050/5147 Progress: 5060/5147 Progress: 5070/5147 Progress: 5080/5147 Progress: 5090/5147 Progress: 5100/5147 Progress: 5110/5147 Progress: 5120/5147 Progress: 5130/5147 Progress: 5140/5147 ============================================================ Saved indices, IDs, distances, correlations in data/all_merged/dup_detection/ ============================================================ Detected 429 possible duplicates in all_merged. ============================================================
# dup.plot_duplicates(df, save_figures=False)
fn = utf.find(f'dup_detection_candidates_{df.name}.csv', f'data/{df.name}/dup_detection')
if fn != []:
print('----------------------------------------------------')
print('Sucessfully finished the duplicate detection process!'.upper())
print('----------------------------------------------------')
print('Saved the detection output file in:')
print()
print('%s.'%', '.join(fn))
print()
print('You are now able to proceed to the next notebook: dup_decision.ipynb')
else:
print('Final output file is missing.')
print()
print('Please re-run the notebook to complete duplicate detection process.')
---------------------------------------------------- SUCESSFULLY FINISHED THE DUPLICATE DETECTION PROCESS! ---------------------------------------------------- Saved the detection output file in: data/all_merged/dup_detection/dup_detection_candidates_all_merged.csv. You are now able to proceed to the next notebook: dup_decision.ipynb