Run the duplicate detection workflow to generate a duplicate free dataframe

This workflow runs a duplicate detection, decision and removal algorithm to generate a duplicate free dataframe.

Required columns

The input dataframe must have the following columns:

archiveType (used for duplicate detection algorithm)
dataSetName
datasetId
geo_meanElev (used for duplicate detection algorithm)
geo_meanLat (used for duplicate detection algorithm)
geo_meanLon (used for duplicate detection algorithm)
geo_siteName (used for duplicate detection algorithm)
interpretation_direction
interpretation_seasonality
interpretation_variable
interpretation_variableDetails
originalDataURL
originalDatabase
paleoData_notes
paleoData_proxy (used for duplicate detection algorithm)
paleoData_units
paleoData_values (used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)
paleoData_variableName
year (used for duplicate detection algorithm)
yearUnits

Output Location

All outputs are saved as csv in the directory data/DATABASENAME/dup_detection.

Step 1: Duplicate detection (`dup_detection.ipynb`)

Notebook: dup_detection.ipynb

This interactive notebook (dup_detection.ipynb) runs a duplicate detection algorithm for a specific database.

1.1 Set up working environment

Make sure the repo_root is added correctly: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks). The following libraries are required to run this notebook

python3/Jupyter

import pandas as pd
import numpy as np

from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions

1.2 Load the compact dataframe

Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).

The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be - pages2k - ch2k - iso2k - sisal - fe23

for the individual databases, or

all_merged

to load the merged database of all individual databases, or can be any user defined compact dataframe.

Load the dataframe using

python3/Jupyter

db_name='all_merged' 
df = utf.load_compact_dataframe_from_csv(db_name)

1.3 Run the duplicate detection algorithm

Now run the first part of the duplicate detection algorithm, which goes through each candidate pair and evaluates the pairs according to a defined set of criteria.

python3/Jupyter

dup.find_duplicates_optimized(df, n_points_thresh=10)

Output: data/DB/dup_detection/dup_detection_candidates_DB.csv

Detection Criteria

metadata criteria:
archive types (archiveType) must be identical
proxy types (paleoData_proxy) must be identical
geographical criteria:
elevation (geo_meanElev) similar, within defined tolerance (use kwarg elevation_tolerance, defaults to 0)
latitude and longtitude (geo_meanLat and geo_meanLon) similar, within defined tolerance in km (use kwarg dist_tolerance_km, defaults to 8 km)
overlap criterion:
time must overlap for at least $n$ points (use kwarg n_points_thresh to modify, defaults to $n=10$) unless at least one of the record is shorter than n_points_thresh
site criterion:
there must be some overlap in the site name (geo_siteName)
correlation criteria:
correlation between the overlapping period must be greater than defined threshold (use corr_thresh to modify, defaults to 0.9) or correlation of first difference must be greater than defined threshold (use corr_diff_thresh to modify, defaults to 0.9)
RMSE of overlapping period must be smaller than defined threshold (use rmse_thresh to modify, defaults to 0.1) or RMSE of first difference must be smaller than defined threshold (use rmse_diff_thresh to modify, defaults to 0.1)
URL criterion:
URLs (originalDataURL) must be identical if both records originate from the same database (originalDatabase must be identical)

Flagging Logic

A potential duplicate candidate pair is flagged, if all of these criteria are satisfied OR the correlation between the candidates is particularly high (>0.98), while there is sufficient overlap (as defined by the overlap criterion).

Tip for large databases

The duplicate detection algorithm can take a while to run, especially for large databases (such as the merged database with over 5000 records). Instead of running this notebook interactively, it might therefore be better to execute it as a python script via the command line.

In order to do this, run

bash

cd ~/dod2k_v2.0/dod2k
mkdir -p scripts
jupyter nbconvert --to python notebooks/dup_detection.ipynb --stdout | \
  sed 's/^get_ipython()/# get_ipython()/' | \
  sed 's/^\([[:space:]]*\)%/\1# %/' > scripts/dup_detection.py

This generates a script dup_detection.py from the command line. Make sure you have modified this file to load the correct database before executing. Then run

bash

python scripts/dup_detection.py

Optional: Plot flagged candidate pairs. Figures are saved to figs/DB/dup_detection/.

python3/Jupyter

dup.plot_duplicates(df, save_figures=True)

Note

These same figures are used in the duplicate decision process.

Step 2: Duplicate decisions (`dup_decision.ipynb`)

Notebook: dup_decision.ipynb

This interactive notebook (dup_decision.ipynb) runs a duplicate decision algorithm for a specific database, following the identification of the potential duplicate candidate pairs. The algorithm walks the operator through each of the detected duplicate candidate pairs from dup_detection.ipynb and runs a decision process to decide whether to keep or reject the identified records.

2.1 Initialisation

To set up the working directory and load the compact dataframe, please follow the instructions detailed in steps 1.1 (set up working directory) and 1.2 (load compact dataframe).

In addition, the operator is asked to provide their credentials along with the decision process. Please fill in your details:

python3/Jupyter

initials = 'FN'
fullname = 'Full Name'
email    = 'name@email.ac.uk'
operator_details = [initials, fullname, email]

Why Credentials?

Initials label intermediate output files
Name and email ensure transparency and traceability

2.2 Hierarchy for duplicate removal for identical duplicates

For automated decisions, which apply to identical duplicates, we have defined a hierarchy (importance level) to the databases, which automatically decides which record should be kept in case of identical data and metadata.

The hierarchy is assigned to the original databases, from 1 the highest value (should always be kept) to the lowest value $n$ (the number of original databases). The hierarchy is added to the dataframe as an additional column (Hierarchy) for the decision process.

The hierarchy is added to the dataframe

python3/Jupyter

# implement hierarchy for automated decisions for identical records

df = dup.define_hierarchy(df, hierarchy='default')

By default the hierarchy uses the novelty of the databases for determining the importance level:

PAGES 2k v2.2.0 > SISAL v3 > CoralHydro2k v1.0.1 > Iso2k v1.1.2 > FE23 (Breitenmoser et al. (2014))

Info

The hierarchy can be changed by providing a dictionary to the hierarchy kwarg:

python3/Jupyter

df = define_hierarchy(df)  # Use default hierarchy
custom = {'PAGES 2k v2.2.0', 'Hierarchy': 2, 'SISAL v3': 1, 'FE23 (Breitenmoser et al. (2014))': 3, 'CoralHydro2k v1.0.': 4, 'Iso2k v1.1.2': 5}
df = define_hierarchy(df, hierarchy=custom)  # Custom hierarchy

Note

The hierarchy is not saved in the final duplicate-free database.

In order to reduce the operator workload, you also have the option to implement an automatic choice for specific database combinations. Please also specify a reason when doing so!

This is meant to be for any records which do not satisfy the hierarchy criterion, i.e. records with different data but identical metadata, such as updated records.

If you do not wish to do this, delete automate_db_choice from kwargs or set to False (default).

For example we have set

python3/Jupyter

automate_db_choice = {'preferred_db': 'FE23 (Breitenmoser et al. (2014))', 
                      'rejected_db': 'PAGES 2k v2.2.0', 
                      'reason': 'conservative replication requirement'}

2.3 Duplicate decision process

Run the decision algorithm:

python3/Jupyter

dup.duplicate_decisions_multiple(df, operator_details=operator_details, choose_recollection=True, 
                                 remove_identicals=True, backup=True, comment=True, automate_db_choice=automate_db_choice)

Decision options for each pair:

keep both records
keep just one record
delete both records
create composite of both records.

Automated Decisions

Recollections/updates: automatically selected
Identical duplicates: highest hierarchy record kept automatically
Automate db choice: as described previously

Example prompts:

Duplicate summary figure — Figure 1: Summary figure of a potential duplicate candidate pair, for which the operator is asked to make a decision.

**Decision required for this duplicate pair (see figure above).**
Before inputting your decision.
Would you like to leave a comment on your decision process?
**COMMENT** Please type your comment here and/or press enter.

 **DECISION** Keep record 1 (pages2k_50, blue circles) [1],
record 2 (FE23_northamerica_canada_cana091, red crosses) [2],
keep both [b], keep none [n] or create a composite of both records [c]?
Note: only overlapping timesteps are being composited. [Type 1/2/b/n/c]:

Output: data/DB/dup_detection/dup_decisions_dod2k_dupfree_INITIALS_DATE.csv

Figures: figs/dup_detection/DB/ (linked in output CSV)

Backup & Resume

The process creates backup files in data/DB/dup_detection/. If interrupted, you can resume from the backup.

Handling of multiple duplicates

The decision process is currently not optimised for handling of multiple duplicates (i.e. records which have more than one potential duplicate candidate), going through the duplicates on a pair-by-pair basis. However, dup.duplicate_decisions_multiple includes improved handling of multiple duplicates. For any records which are associated with multiple duplicates, all the other duplicate candidates are shown alongside the summary figure for the duplicate candidate pair. Any previous decisions, when available, are shown besides the datasetId, archiveType, paleoData_proxy etc.:

***ATTENTION*** THIS RECORD IS ASSOCIATED WITH MULTIPLE DUPLICATES! 
PLEASE PAY SPECIAL ATTENTION WHEN MAKING DECISIONS FOR THIS RECORD!
The potential duplicates also associated with this record are:
 Dataset ID          : iso2k_786
     - URL                 : https://www.ncdc.noaa.gov/paleo/study/1856

Multiple summary figure — Figure 2: Summary figure for multiple duplicates.

The operator can then make an informed decision for each candidate pair.

Can I Reverse a Decision?

There is currently no option to reverse a decision while running the duplicate decisions. However should the operator want to revise a previous decision they have two options:

Most recent decision: Interrupt the process, remove the last line from the backup file (data/DB/dup_detection/dup_decisions_DB_INITIALS_BACKUP.csv), then restart.
Any decision: Interrupt and directly edit the backup file columns 'Decision 1' and 'Decision 2'. Use only: KEEP, REMOVE, or COMPOSITE.
After completion: Manually edit the final output file with correct terminology.

Step 3: Duplicate removal (`dup_removal.ipynb`)

Notebook: dup_removal.ipynb

This notebook removes duplicates based on the operator's previous decisions (see Step 2).

3.1 Initialisation

To set up the working directory and load the compact dataframe, please follow the instructions detailed in steps 1.1 (set up working directory), 1.2 (load compact dataframe) and 2.1 (provide operator credentials).

In addition, datasetId is set as dataframe index to reliably identify the duplicates later on:

python3/Jupyter

df.set_index('datasetId', inplace = True)
df['datasetId']=df.index

3.2 Load duplicate decisions from csv

In order to load the duplicate decisions from csv, the operator initials and the date need to be specified, to match the desired decision output file.

Accordingly, the decision output file is loaded from data/DBNAME/dup_detection/dup_decisions_DBNAME_INITIALS_DATE.csv:

python3/Jupyter

filename      = f'data/{df.name}/dup_detection/dup_decisions_{df.name}_{initials}_{date}'
data, header  = dup.read_csv(filename, header=True)
df_decisions  = pd.read_csv(filename+'.csv', header=5)

dup.read_csv reads the header, which provides the operator's details as saved in the decision file, along with any comments on the general decision process. Later in the notebook, header is written into a metadata file which should be provided alongside the duplicate free dataset. df_decisions is a pandas dataframe which is populated with the decision data, record by record, and will be used to implement the decisions to create a duplicate free dataset.

3.3 Implement duplicate decisions

From df_decisions we extract a dictionary which includes all decisions for each individual record (instead of pairwise decisions as in df_decisions):

python3/Jupyter

# Collect decisions for each record
decisions = dup.collect_record_decisions(df_decisions)

This dictionary can be used to identify and track decisions for multiple duplicates (records which are associated with more than one duplicate candidate pair).

Note

Note that any one record can appear more than once and have multiple decisions associated with it (e.g. 'REMOVE', 'KEEP' or 'COMPOSITE').

In order to remove the duplicates we therefore implement the following steps:

Remove all records from the dataframe which are associated with the decision 'REMOVE' or COMPOSITE -> df_cleaned
Create composites of the COMPOSITE records -> df_composite
Check for records which have multiple decisions associated. These are potentially remaining duplicates.

We also extract the details of each decisions, which will later be used to populate the field duplicateDetails in the final dataframe (the output of this notebook). The details provide information on the nature of the decision (automatically determined or manually, i.e. by the operator), as well as operator's comments.

python3/Jupyter

# Collect duplicate details for each record
dup_details = dup.collect_dup_details(df_decisions, header)

3.3.1. Records to be removed

First simply remove all the records to which the decision REMOVE or COMPOSITE applies to and store in df_cleaned, while all 'REMOVE' or 'COMPOSITE' type records are stored in df_duplica_rmv (for later inspection).

python3/Jupyter

# load the records TO BE REMOVED OR COMPOSITED
remove_IDs  = list(df_decisions['datasetId 1'][np.isin(df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
remove_IDs += list(df_decisions['datasetId 2'][np.isin(df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
remove_IDs  = np.unique(remove_IDs)

df_duplica =  df.loc[remove_IDs, 'datasetId'] # df containing only records which were removed
df_cleaned =  df.drop(remove_IDs) # df freed from 'REMOVE' type duplicates

print(f'Removed {len(df_duplica)} REMOVE or COMPOSITE type records.')
print(f'REMOVE type duplicate free dataset contains {len(df_cleaned)} records.')
print('Removed the following IDs:', remove_IDs)

df_cleaned then contains all data apart from records which are marked as REMOVE or COMPOSITE. Thus, it only keeps the records which either were never marked as duplicates or where the operator had decided to keep a duplicate.

Note that the duplicateDetails need to be added to df_cleaned via

python3/Jupyter

df_cleaned['duplicateDetails']='N/A'
for ID in dup_details:
    if ID in df_cleaned.index: 
        if df_cleaned.at[ID, 'duplicateDetails']=='N/A': 
            df_cleaned.at[ID, 'duplicateDetails']=dup_details[ID]
        else: df_cleaned.at[ID, 'duplicateDetails']+=dup_details[ID]

3.3.2. Records to be composited

Now identify all the records to which the decision 'COMPOSITE' applies to, create composites and store in df_composite. For differences in the numerical metadata we use the average (e.g. geo_meanLat, geo_meanLon, ...), while for string types we merge the strings to form a composite. The datasetId is created from both original values to 'f{df.name}_composite_z_{ID_1}_{ID_2}', with ID_1 and ID_2 the original datasetId for each record. The data is being composited by averaging the z-scores of the original data.

python3/Jupyter

# add the column 'duplicateDetails' to df, in case it does not exist
if 'duplicateDetails' not in df.columns: df['duplicateDetails']='N/A'

# load the records to be composited
comp_ID_pairs = df_decisions[(df_decisions['Decision 1']=='COMPOSITE')&(df_decisions['Decision 2']=='COMPOSITE')]

# create new composite data and metadata from the pairs
# loop through the composite pairs and check metadata
df_composite = dup.join_composites_metadata(df, comp_ID_pairs, df_decisions, header)

The function join_composites_metadata also creates summary figures of the composites in order to supervise the composition process.

3.3.3. Check for multiple duplicate records with different decisions

In order to obtain the duplicate free dataframe we merge df_cleaned and df_composite:

python3/Jupyter

tmp_df_dupfree = pd.concat([df_cleaned, df_composite])
tmp_df_dupfree.index = tmp_df_dupfree['datasetId']
tmp_decisions = decisions.copy()

This dataframe initiates a loop in which the records which are associated with multiple decisions are fed into another round of duplicate detection, decisions and removal. This is necessary to ensure that no duplicates remain in the merged dataframe because of combined decisions.

Example

REMOVE/KEEP and COMPOSITE:
- duplicate pair a and b have had the decisions assigned: a → REMOVE, b → KEEP
- duplicate pair a and c have had the decisions assigned: a → COMPOSITE, c → COMPOSITE
- In this case, b and ac (the composite record of a and c) would be duplicates in the merged dataframe
REMOVE/KEEP & REMOVE/KEEP
- duplicate pair a and b have had the decisions assigned: a → REMOVE, b → KEEP
- duplicate pair a and c have had the decisions assigned: a → REMOVE, c → KEEP
- In this case, a would be removed, but b and c will be kept and would be duplicates in the merged dataframe
COMPOSITE × 2
- duplicate pair a and b have had the decisions assigned: a → COMPOSITE, b → COMPOSITE
- duplicate pair a and c have had the decisions assigned: a → COMPOSITE, c → COMPOSITE
- In this case, ab and ac would be duplicates in the merged dataframe

The loop iterates for a maximum of ten, but stops as soon as no duplicates are detected anymore in the dataframe subset. Note that this loop only checks among the records associated with more than one decision. In each iteration, the operator also has the opportunity to end the duplicate search. Note also that it is not advised to create multiple iterations of composites.

python3/Jupyter

# Simple composite tracking for debugging only
composite_log = []

for ii in range(10): 
    tmp_df_dupfree.set_index('datasetId', inplace = True)
    tmp_df_dupfree['datasetId']=tmp_df_dupfree.index

    print('-'*20)
    print(f'ITERATION # {ii}')

    multiple_dups = []
    for id in tmp_decisions.keys():
        if len(tmp_decisions[id]) > 1:
            if id not in multiple_dups:
                multiple_dups.append(id)

    if len(multiple_dups) > 0:
        # Check which of the multiple duplicate IDs are still in the dataframe
        multiple_dups_new = []
        current_ids = set(tmp_df_dupfree.index)  # Get all current IDs as a set

        for id in multiple_dups:
            if id in current_ids:  # Simple membership check
                multiple_dups_new.append(id)

        if len(multiple_dups_new) > 0:
            print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
            print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
        else:
            print('No more multiple duplicates found in current dataframe.')
            print('SUCCESS!!')
            break
    else:
        print('No more multiple duplicates.')
        print('SUCCESS!!')
        break

    # Now we create a small dataframe which needs to be checked for duplicates.
    df_check = tmp_df_dupfree.copy()[np.isin(tmp_df_dupfree['datasetId'], multiple_dups_new)]
    print('Check dataframe: ')
    df_check.name = 'tmp'
    df_check.index = range(len(df_check))
    print(df_check.info())
    # We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
    pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
    if len(pot_dup_IDs)==0:
        print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
        break
    else:
        yn=''
        while yn not in ['y', 'n']:
            yn = input('Do you want to continue with the decision process for duplicates? [y/n]')
        if yn=='n': break

    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False, comment=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)


    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]

    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index

    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates

    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]

    if len(tmp_comp_ID_pairs) > 0:
        for _, pair in tmp_comp_ID_pairs.iterrows():
            id1, id2 = pair['datasetId 1'], pair['datasetId 2']
            # Log what was composited
            composite_log.append({
                'iteration': ii,
                'composited': [id1, id2],
                'new_id': f"{id1}_{id2}_composite"  # or however you generate it
            })
    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)

    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])
    print('--'*20)
    print('Finished iteration.')

    print('NEW DATAFRAME:')
    print(tmp_df_dupfree.info())

    print('--'*20)
    print('--'*20)
    if ii==19: print('STILL DUPLICATES PRESENT AFTER MULTIPLE ITERATIONS! REVISE DECISION PROCESS!!')

    print('--'*20)

print(f"Created {len(composite_log)} composites across all iterations")

As soon as no more duplicates are detected among the remaining candidates, the loop outputs:

No more multiple duplicates.
SUCCESS!!

3.4 Check entire dataframe for remaining duplicates

In order to check that all duplicates have definitely been removed from the dataframe, we run another round of duplicate detection, decisions and removal, using a similar workflow as in the previous step:

python3/Jupyter

tmp_df_dupfree.set_index('datasetId', inplace = True)
tmp_df_dupfree['datasetId']=tmp_df_dupfree.index

# Now we create a  dataframe which needs to be checked for duplicates.
df_check = tmp_df_dupfree.copy()
df_check.name = 'tmp'
df_check.index = range(len(df_check))
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
    print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
else:
    df_check = dup.define_hierarchy(df_check)
    dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True, 
                            remove_identicals=False, backup=False)
    # implement the decisions
    tmp_df_decisions  = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
    tmp_dup_details   = dup.provide_dup_details(tmp_df_decisions, header)


    # decisions
    tmp_decisions = {}
    for ind in tmp_df_decisions.index:
        id1, id2   = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
        dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
        for id, dec in zip([id1, id2], [dec1, dec2]):
            if id not in tmp_decisions: tmp_decisions[id] = []
            tmp_decisions[id]+=[dec]

    df_check.set_index('datasetId', inplace = True)
    df_check['datasetId']=df_check.index

    #drop all REMOVE or COMPOSITE types
    tmp_remove_IDs  = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
    tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
    tmp_df_cleaned = tmp_df_dupfree.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates

    # # composite the 
    tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]

    # # create new composite data and metadata from the pairs
    # # loop through the composite pairs and check metadata
    tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)

    tmp_df_dupfree = pd.concat([tmp_df_cleaned, tmp_df_composite])

    print('Finished last round of duplicate removal.')
    print('Potentially run through this cell again to check for remaining duplicates.')

Warning

This step runs an entire duplicate detection and thus can take a substantial amount of time, as previously. Alternatively, you can skip this step, output the dataframe and feed it back into dup_detection.ipynb and repeat the duplicate workflow.

3.5 Save duplicate free dataframe

Once the operator is satisfied that no more duplicates remain, the final dataframe can be created

python3/Jupyter

df_dupfree = tmp_df_dupfree
print(df_dupfree.info())

and saved via

python3/Jupyter

df_dupfree = df_dupfree[sorted(df_dupfree.columns)]
df_dupfree.name =f'{df.name}_{initials}_{date}_dupfree'
os.makedirs(f'data/{df_dupfree.name}/', exist_ok=True)


utf.write_compact_dataframe_to_csv(df_dupfree)

In order to provide the associated operator's information (such as details, date of creation and operator's comments), we also create the README file:

python3/Jupyter

# write header with operator information as README txt file
file = open(f'data/{df_dupfree.name}/{df_dupfree.name}_dupfree_README.txt', 'w')
for line in header:
    file.write(line+'\n')
file.close()

Workflow Complete

The duplicate detection workflow is now finished!

Info

For more details on the interactive notebooks, see 1. dup_detection.ipynb 2. dup_decision.ipynb 3. dup_removal.ipynb

Run the duplicate detection workflow to generate a duplicate free dataframe

Required columns

Step 1: Duplicate detection (dup_detection.ipynb)

1.1 Set up working environment

1.2 Load the compact dataframe

1.3 Run the duplicate detection algorithm

Step 2: Duplicate decisions (dup_decision.ipynb)

2.1 Initialisation

2.2 Hierarchy for duplicate removal for identical duplicates

2.3 Duplicate decision process

Step 3: Duplicate removal (dup_removal.ipynb)

3.1 Initialisation

3.2 Load duplicate decisions from csv

3.3 Implement duplicate decisions

3.3.1. Records to be removed

3.3.2. Records to be composited

3.3.3. Check for multiple duplicate records with different decisions

3.4 Check entire dataframe for remaining duplicates

3.5 Save duplicate free dataframe

Step 1: Duplicate detection (`dup_detection.ipynb`)

Step 2: Duplicate decisions (`dup_decision.ipynb`)

Step 3: Duplicate removal (`dup_removal.ipynb`)