Skip to content

Data Analysis Functions

Module containing functions for analysing, filtering, transforming, and processing climate data (written for use in e.g. analysis_T.ipynb, analysis_M.ipynb, analysis_MT.ipynb) .

dod2k_utilities.ut_analysis

Author: Lucie Luecke (includes functions by Feng Zhu)

Provides functions for filtering, homogenising, manipulating and analysing data(frames).

19/12/2025 last updated for publication for dod2k v2.0

PCA(covariance)

Perform Principal Component Analysis using singular value decomposition (SVD).

Parameters:

Name Type Description Default
covariance ndarray

Covariance matrix of shape (n_records, n_records).

required

Returns:

Name Type Description
eigenvalues ndarray

Eigenvalues from SVD (singular values), sorted in descending order as returned by numpy's SVD implementation.

eigenvectors ndarray

Right singular vectors (Vh) corresponding to the eigenvectors of the covariance matrix.

Notes

This function performs PCA on paleoclimate records. The SVD decomposition yields: - U: left singular vectors (not returned) - s: singular values (returned as eigenvalues) - Vh: right singular vectors (returned as eigenvectors)

The eigenvectors can be used to project the data onto principal components. The eigenvalues indicate the variance explained by each component.

Source code in dod2k_utilities/ut_analysis.py
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
def PCA(covariance):
    """
    Perform Principal Component Analysis using singular value decomposition (SVD).

    Parameters
    ----------
    covariance : numpy.ndarray
        Covariance matrix of shape (n_records, n_records).

    Returns
    -------
    eigenvalues : numpy.ndarray
        Eigenvalues from SVD (singular values), sorted in descending order
        as returned by numpy's SVD implementation.
    eigenvectors : numpy.ndarray
        Right singular vectors (Vh) corresponding to the eigenvectors of
        the covariance matrix.

    Notes
    -----
    This function performs PCA on paleoclimate records. The SVD decomposition yields:
    - U: left singular vectors (not returned)
    - s: singular values (returned as eigenvalues)
    - Vh: right singular vectors (returned as eigenvectors)

    The eigenvectors can be used to project the data onto principal components.
    The eigenvalues indicate the variance explained by each component.
    """
    U, s, Vh = np.linalg.svd(covariance) # s eigenvalues, U, Vh rotation matrices

    eigenvalues  = s
    eigenvectors = Vh


    return eigenvalues, eigenvectors

add_auxvars_plot_summary(df_filtered, key, mincount=0, col='k', **kwargs)

Add auxiliary variables to the DataFrame and generate summary plots.

Adds 'length', 'miny', and 'maxy' columns, then plots coverage, resolution, and length distributions.

Parameters:

Name Type Description Default
df_filtered DataFrame

DataFrame containing 'year' and 'paleoData_values'.

required
key str

Title for plots and archiveType label.

required
mincount int

Minimum count threshold for plotting resolution and length. Default is 0.

0
col str

Color for plotting. Default is 'k' (black).

'k'
**kwargs dict

Additional keyword arguments passed to plot_coverage.

{}

Returns:

Type Description
DataFrame

The DataFrame with added auxiliary columns ('length', 'miny', 'maxy').

Notes

This function modifies the DataFrame in place by adding new columns and also calls add_resolution_to_df to compute resolution.

Source code in dod2k_utilities/ut_analysis.py
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
def add_auxvars_plot_summary(df_filtered, key, mincount=0, col='k', **kwargs):
    """
    Add auxiliary variables to the DataFrame and generate summary plots.

    Adds 'length', 'miny', and 'maxy' columns, then plots coverage, resolution,
    and length distributions.

    Parameters
    ----------
    df_filtered : pandas.DataFrame
        DataFrame containing 'year' and 'paleoData_values'.
    key : str
        Title for plots and archiveType label.
    mincount : int, optional
        Minimum count threshold for plotting resolution and length. Default is 0.
    col : str, optional
        Color for plotting. Default is 'k' (black).
    **kwargs : dict
        Additional keyword arguments passed to `plot_coverage`.

    Returns
    -------
    pandas.DataFrame
        The DataFrame with added auxiliary columns ('length', 'miny', 'maxy').

    Notes
    -----
    This function modifies the DataFrame in place by adding new columns and
    also calls `add_resolution_to_df` to compute resolution.
    """

    # add 'length, miny, maxy' to dataframe
    df_filtered['length'] = df_filtered.paleoData_values.apply(len)
    df_filtered['miny']   = df_filtered.year.apply(np.min)
    df_filtered['maxy']   = df_filtered.year.apply(np.max)

    add_resolution_to_df(df_filtered, print_output=True)

    df_tmp = df_filtered.copy()
    df_tmp['archiveType']=key

    # years    = np.arange(min(df.miny), max(df.maxy)+1)
    uplt.plot_coverage(df_tmp, [key], [key], [], {key: col}, **kwargs)

    uplt.plot_resolution(df_tmp, key, mincount=mincount, col=col)
    uplt.plot_length(df_tmp, key, mincount=mincount, col=col)

    return df_filtered

add_resolution_to_df(df, print_output=False, plot_output=False)

Compute the time resolution of each record and store in the DataFrame.

Sorts the time and data values, then calculates the resolution as the unique differences between consecutive years.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing 'year' and 'paleoData_values'.

required
print_output bool

If True, prints debug information. Default is False.

False
plot_output bool

Currently unused; reserved for future plotting. Default is False.

False

Returns:

Type Description
None

The function updates the 'resolution' column of the DataFrame in place.

Source code in dod2k_utilities/ut_analysis.py
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
def add_resolution_to_df(df, print_output=False, plot_output=False):
    """
    Compute the time resolution of each record and store in the DataFrame.

    Sorts the time and data values, then calculates the resolution as the 
    unique differences between consecutive years.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing 'year' and 'paleoData_values'.
    print_output : bool, optional
        If True, prints debug information. Default is False.
    plot_output : bool, optional
        Currently unused; reserved for future plotting. Default is False.

    Returns
    -------
    None
        The function updates the 'resolution' column of the DataFrame in place.
    """


    # sort year and data values and obtain resolution
    df['paleoData_values']= df.apply(lambda x: x.paleoData_values[np.argsort(x.year)], axis=1)
    df['year']= df.apply(lambda x: np.round(x.year[np.argsort(x.year)], 2), axis=1)
    df['resolution']= df.year.apply(np.diff).apply(np.unique)


    return 

add_zscores_plot(df, key, plot_output=True)

Add z-scores of paleoData_values to the DataFrame.

This function calculates the z-score for each record and adds a new column 'paleoData_zscores'. Optionally plots the original values and z-scores.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing 'year' and 'paleoData_values'.

required
key str

Title for the plot.

required
plot_output bool

If True, generates a diagnostic plot. Default is True.

True

Returns:

Type Description
DataFrame

The DataFrame with an added 'paleoData_zscores' column.

Source code in dod2k_utilities/ut_analysis.py
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
def add_zscores_plot(df, key, plot_output=True):
    """
    Add z-scores of paleoData_values to the DataFrame.

    This function calculates the z-score for each record and adds a new column
    'paleoData_zscores'. Optionally plots the original values and z-scores.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing 'year' and 'paleoData_values'.
    key : str
        Title for the plot.
    plot_output : bool, optional
        If True, generates a diagnostic plot. Default is True.

    Returns
    -------
    pandas.DataFrame
        The DataFrame with an added 'paleoData_zscores' column.
    """

    df['paleoData_zscores'] = df.apply(lambda x: calc_z_score(x), axis=1)


    # plot paleoData_values and paleoData_zscores
    fig = plt.figure()
    plt.suptitle(key)
    plt.subplot(211)
    plt.ylabel('paleoData_values')#, y=0.85)
    for ii in df.index:
        plt.plot(df.at[ii, 'year'],
                 df.at[ii, 'paleoData_values'], lw=1)
    plt.xticks([])
    plt.subplot(212)
    plt.ylabel('paleoData_zscores')#, y=0.85)
    plt.xlabel('year CE')
    for ii in df.index:
        plt.plot(df.at[ii, 'year'],
                 df.at[ii, 'paleoData_zscores'], lw=1)
    fig.tight_layout()
    return df

calc_covariance_matrix(df)

Compute the covariance matrix and overlapping years for all records.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing homogenised z-score arrays (paleoData_zscores_hom_avbl) and their corresponding available years (year_hom_avbl).

required

Returns:

Name Type Description
covariance ndarray

Covariance matrix of shape (n_records, n_records) between all records.

overlap ndarray

Matrix of shape (n_records, n_records) containing the number of overlapping years between each pair of records.

Source code in dod2k_utilities/ut_analysis.py
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
def calc_covariance_matrix(df):
    """
    Compute the covariance matrix and overlapping years for all records.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing homogenised z-score arrays (`paleoData_zscores_hom_avbl`) 
        and their corresponding available years (`year_hom_avbl`).

    Returns
    -------
    covariance : numpy.ndarray
        Covariance matrix of shape (n_records, n_records) between all records.
    overlap : numpy.ndarray
        Matrix of shape (n_records, n_records) containing the number of overlapping 
        years between each pair of records.
    """
    n_recs = len(df)

    covariance = np.zeros([n_recs, n_recs])
    overlap    = np.zeros([n_recs, n_recs])
    for ii in range(n_recs):
        for jj in range(ii, n_recs):
            # print(ii, jj)

            time_1 = df.iloc[ii].year_hom_avbl
            time_2 = df.iloc[jj].year_hom_avbl
            time_12, int_1, int_2 = np.intersect1d(time_1, time_2, return_indices=True) # saves intersect between the records

            overlap[ii, jj] = len(time_12)
            overlap[jj, ii] = len(time_12)

            data_1 = df.iloc[ii].paleoData_zscores_hom_avbl
            data_1 -= np.mean(data_1)
            # data_1 /= np.std(data_1)
            data_2 = df.iloc[jj].paleoData_zscores_hom_avbl
            data_2 -= np.mean(data_2)
            # data_2 /= np.std(data_2)
            covariance[ii, jj] = np.cov(data_1[int_1], data_2[int_2], bias=False)[0,1]      #Default normalization (False) is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is True, then normalization is by N.

            covariance[jj, ii] =covariance[ii, jj]

    print('short records : ', overlap[overlap<40])
    return covariance, overlap

calc_z_score(x)

Calculate the z-score of paleoData_values for a single record.

Parameters:

Name Type Description Default
x Series

Series representing a single record with a 'paleoData_values' array.

required

Returns:

Type Description
ndarray

Z-scored values of the record.

Source code in dod2k_utilities/ut_analysis.py
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
def calc_z_score(x):
    """
    Calculate the z-score of paleoData_values for a single record.

    Parameters
    ----------
    x : pandas.Series
        Series representing a single record with a 'paleoData_values' array.

    Returns
    -------
    numpy.ndarray
        Z-scored values of the record.
    """
    # calculate z-score 
    z = x.paleoData_values-np.mean(x.paleoData_values)
    z /= np.std(x.paleoData_values)
    return z

convert_subannual_to_annual_res(df)

Convert sub-annual data to annual averages.

For each record in the DataFrame, this function computes yearly averages of the 'paleoData_values' and replaces the original 'year' array with integer years.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing 'year' and 'paleoData_values' columns, where 'year' may contain sub-annual (fractional) time values.

required

Returns:

Type Description
None

The function modifies the DataFrame in place.

Notes

For each record: - Years are floored to get integer year values - Data values are averaged within each integer year - Both 'year' and 'paleoData_values' columns are replaced with annual values

Source code in dod2k_utilities/ut_analysis.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def convert_subannual_to_annual_res(df):
    """
    Convert sub-annual data to annual averages.

    For each record in the DataFrame, this function computes yearly averages
    of the 'paleoData_values' and replaces the original 'year' array with 
    integer years.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing 'year' and 'paleoData_values' columns, where 'year'
        may contain sub-annual (fractional) time values.

    Returns
    -------
    None
        The function modifies the DataFrame in place.

    Notes
    -----

    For each record:
    - Years are floored to get integer year values
    - Data values are averaged within each integer year
    - Both 'year' and 'paleoData_values' columns are replaced with annual values
    """
    for ii in df.index:
        year = df.at[ii, 'year']
        sy = np.min(year)
        year_ar = np.unique(np.floor(year))
        # print(year_ar)
        data_ar = []
        for yy in year_ar:
            mask = np.floor(year)==yy
            # print(df.at[ii, 'year'][mask])
            data_ar+=[np.mean(df.at[ii, 'paleoData_values'][mask])]
        # print(data_ar)
        df.at[ii, 'paleoData_values'] = np.array(data_ar)
        df.at[ii, 'year'] = year_ar
    return 

filter_data_availability(df, mny, mxy)

Filter records based on data availability within a given year range.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'year' column, where each entry is an array-like of years available for the record.

required
mny int

Start year of the range.

required
mxy int

End year of the range.

required

Returns:

Type Description
DataFrame

A filtered DataFrame containing only records that have data available between mny and mxy.

Notes

Removes records that have no data within the specified year range. Prints the indices of removed records and summary statistics.

Source code in dod2k_utilities/ut_analysis.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def filter_data_availability(df, mny, mxy):
    """
    Filter records based on data availability within a given year range.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing a 'year' column, where each entry is an array-like
        of years available for the record.
    mny : int
        Start year of the range.
    mxy : int
        End year of the range.

    Returns
    -------
    pandas.DataFrame
        A filtered DataFrame containing only records that have data available
        between `mny` and `mxy`.

    Notes
    -----
    Removes records that have no data within the specified year range.
    Prints the indices of removed records and summary statistics.
    """

    remove = []
    for ii in df.index:
        if np.sum((df.at[ii, 'year']>=mny)&(df.at[ii, 'year']<=mxy))==0:
            # print('No available data', ii)
            remove+=[ii]
    df=df.drop(labels=remove)
    print('No available data: ', remove)
    print('Keep %d records with data available between %d-%d. Exclude %d records.'%(df.shape[0], mny, mxy, len(remove)))

    return df

filter_record_length(df, nyears, mny, mxy)

Filter records based on the number of years with data in a given range.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'year' column, where each entry is an array-like of years available for the record.

required
nyears int

Minimum number of years required within the specified range.

required
mny int

Start year of the range.

required
mxy int

End year of the range.

required

Returns:

Type Description
DataFrame

A filtered DataFrame containing only records with at least nyears of data between mny and mxy.

Notes

Records are dropped in-place if they don't meet the minimum year requirement. Prints the number of records kept and excluded after filtering.

Source code in dod2k_utilities/ut_analysis.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def filter_record_length(df, nyears, mny, mxy):
    """
    Filter records based on the number of years with data in a given range.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing a 'year' column, where each entry is an array-like
        of years available for the record.
    nyears : int
        Minimum number of years required within the specified range.
    mny : int
        Start year of the range.
    mxy : int
        End year of the range.

    Returns
    -------
    pandas.DataFrame
        A filtered DataFrame containing only records with at least `nyears`
        of data between `mny` and `mxy`.

    Notes
    -----
    Records are dropped in-place if they don't meet the minimum year requirement.
    Prints the number of records kept and excluded after filtering.
    """

    remove = []
    for ii in df.index:
        if np.sum((df.at[ii, 'year']>=mny)&(df.at[ii, 'year']<=mxy))<nyears:
            # print('No available data', ii)
            remove+=[ii]

    df=df.drop(labels=remove)

    print('Keep %d records with nyears>=%d during %d-%d. Exclude %d records.'%(df.shape[0], nyears, mny, mxy, len(remove)))    
    return df

filter_resolution(df, maxres)

Filter records in a DataFrame based on maximum resolution.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'resolution' column, where each entry is a list of numeric resolution values.

required
maxres int or float

Maximum allowed resolution. Records with all resolution values less than or equal to maxres are kept.

required

Returns:

Type Description
DataFrame

A filtered DataFrame containing only records with resolution <= maxres.

Notes

Prints the number of records kept and excluded after filtering.

Source code in dod2k_utilities/ut_analysis.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def filter_resolution(df, maxres):
    """
    Filter records in a DataFrame based on maximum resolution.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing a 'resolution' column, where each entry is a list 
        of numeric resolution values.
    maxres : int or float
        Maximum allowed resolution. Records with all resolution values less than
        or equal to `maxres` are kept.

    Returns
    -------
    pandas.DataFrame
        A filtered DataFrame containing only records with resolution <= maxres.

    Notes
    -----
    Prints the number of records kept and excluded after filtering.
    """
    rmask = df.resolution.apply(lambda x: np.all(np.array(x)<=maxres))
    print('Keep %d records with resolution <=%d. Exclude %d records.'%(len(df[rmask]), maxres, len(df[~rmask])))

    return df[rmask]

find_nearest2d(da, lat, lon, lat_name='lat', lon_name='lon', new_dim='sites', r=1)

Find nearest valid grid points in 2D xarray DataArray to given coordinates.

Selects the nearest grid point to specified lat/lon coordinates. If the nearest point is NaN, searches within a radius r for the closest valid point.

Parameters:

Name Type Description Default
da DataArray

Input DataArray with latitude and longitude dimensions.

required
lat float or array - like

Target latitude(s) in decimal degrees.

required
lon float or array - like

Target longitude(s) in decimal degrees.

required
lat_name str

Name of latitude dimension in da. Default is 'lat'.

'lat'
lon_name str

Name of longitude dimension in da. Default is 'lon'.

'lon'
new_dim str

Name for new dimension when concatenating results for multiple sites. Default is 'sites'.

'sites'
r float

Search radius in degrees when nearest point is invalid. Default is 1.

1

Returns:

Type Description
DataArray

DataArray values at nearest valid grid points. If multiple lat/lon pairs provided, returns concatenated results along new_dim.

Raises:

Type Description
ValueError

If no valid values found within search radius.

Notes

Author: Feng Zhu

The function uses great circle distance to find the closest valid grid point when the simple nearest neighbor is NaN.

Examples:

>>> # Extract data at single location
>>> temp_site = find_nearest2d(temp_data, 52.52, 13.40)
>>> # Extract data at multiple locations
>>> lats = [40.7, 51.5, 48.8]
>>> lons = [-74.0, -0.1, 2.3]
>>> temps_sites = find_nearest2d(temp_data, lats, lons, r=2)
Source code in dod2k_utilities/ut_analysis.py
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
def find_nearest2d(da:xr.DataArray, lat, lon, lat_name='lat', lon_name='lon', new_dim='sites', r=1):
    """
    Find nearest valid grid points in 2D xarray DataArray to given coordinates.

    Selects the nearest grid point to specified lat/lon coordinates. If the
    nearest point is NaN, searches within a radius r for the closest valid point.

    Parameters
    ----------
    da : xr.DataArray
        Input DataArray with latitude and longitude dimensions.
    lat : float or array-like
        Target latitude(s) in decimal degrees.
    lon : float or array-like
        Target longitude(s) in decimal degrees.
    lat_name : str, optional
        Name of latitude dimension in da. Default is 'lat'.
    lon_name : str, optional
        Name of longitude dimension in da. Default is 'lon'.
    new_dim : str, optional
        Name for new dimension when concatenating results for multiple sites.
        Default is 'sites'.
    r : float, optional
        Search radius in degrees when nearest point is invalid. Default is 1.

    Returns
    -------
    xr.DataArray
        DataArray values at nearest valid grid points. If multiple lat/lon pairs
        provided, returns concatenated results along new_dim.

    Raises
    ------
    ValueError
        If no valid values found within search radius.

    Notes
    -----
    Author: Feng Zhu

    The function uses great circle distance to find the closest valid grid point
    when the simple nearest neighbor is NaN.

    Examples
    --------
    >>> # Extract data at single location
    >>> temp_site = find_nearest2d(temp_data, 52.52, 13.40)

    >>> # Extract data at multiple locations
    >>> lats = [40.7, 51.5, 48.8]
    >>> lons = [-74.0, -0.1, 2.3]
    >>> temps_sites = find_nearest2d(temp_data, lats, lons, r=2)
    """
    da_res = da.sel({lat_name: lat, lon_name:lon}, method='nearest')
    if da_res.isnull().any():
        if isinstance(lat, (int, float)): lat = [lat]
        if isinstance(lon, (int, float)): lon = [lon]
        da_res_list = []
        for la, lo in zip(lat, lon):
            mask_lat = (da.lat > la-r)&(da.lat < la+r)
            mask_lon = (da.lon > lo-r)&(da.lon < lo+r)
            # da_sub = da.sel({lat_name: slice(la-r, la+r), lon_name: slice(lo-r, lo+r)})
            da_sub = da.sel({'lat': mask_lat, 'lon': mask_lon})
            dist = gcd(da_sub[lat_name], da_sub[lon_name], la, lo)
            da_sub_valid = da_sub.where(~np.isnan(da_sub), drop=True)
            valid_mask = ~np.isnan(da_sub_valid)
            if valid_mask.sum() == 0:
                print('la:', la)
                print('lo:', lo)
                print('la+r:', la+r)
                print('la-r:', la-r)
                print('lo+r:', lo+r)
                print('lo-r:', lo-r)
                print(da_sub)
                raise ValueError('No valid values found. Please try larger `r` values.')

            dist_min = dist.where(dist == dist.where(~np.isnan(da_sub_valid)).min(), drop=True)
            nearest_lat = dist_min[lat_name].values.item()
            nearest_lon = dist_min[lon_name].values.item()
            da_res = da_sub_valid.sel({lat_name: nearest_lat, lon_name: nearest_lon}, method='nearest')
            da_res_list.append(da_res)
        da_res = xr.concat(da_res_list, dim=new_dim).squeeze()

    return da_res

find_shared_period(df, minmax=False, time='year', data='paleoData_zscores')

Determine the shared time period across all records.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the records with time and data columns.

required
minmax tuple or list

Year range to use for plotting if no shared period exists. Default is False.

False
time str

Name of the column containing the time axis. Default is 'year'.

'year'
data str

Name of the data column to plot if no shared period exists. Default is 'paleoData_zscores'.

'paleoData_zscores'

Returns:

Name Type Description
miny int or float

Minimum year of the shared period, or np.nan if none.

maxy int or float

Maximum year of the shared period, or np.nan if none.

Source code in dod2k_utilities/ut_analysis.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
def find_shared_period(df, minmax=False, time='year', data='paleoData_zscores'):
    """
    Determine the shared time period across all records.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the records with time and data columns.
    minmax : tuple or list, optional
        Year range to use for plotting if no shared period exists. Default is False.
    time : str, optional
        Name of the column containing the time axis. Default is 'year'.
    data : str, optional
        Name of the data column to plot if no shared period exists. Default is 'paleoData_zscores'.

    Returns
    -------
    miny : int or float
        Minimum year of the shared period, or np.nan if none.
    maxy : int or float
        Maximum year of the shared period, or np.nan if none.
    """
    try:

        miny = np.min(reduce(np.intersect1d, df[time]))
        maxy = np.max(reduce(np.intersect1d, df[time]))
        print('INTERSECT: %d-%d'%(miny, maxy))
    except ValueError:
        print('No shared period across all records.')
        miny = np.nan
        maxy = np.nan
        # plt.figure()
        # for jj, ii in enumerate(df.index):
        #     dd = df.at[ii, data]
        #     yy = df.at[ii, time]
        #     plt.plot(yy, dd-np.mean(dd)+jj)
        #     if minmax: plt.xlim(minmax[0], minmax[-1])
    return miny, maxy

fraction_of_explained_var(covariance, eigenvalues, n_recs, title='', db_name='', col='tab:blue')

Compute and plot the fraction of variance explained by principal components.

Parameters:

Name Type Description Default
covariance ndarray

Covariance matrix of the records.

required
eigenvalues ndarray

Eigenvalues from PCA/SVD.

required
n_recs int

Number of records.

required
title str

Title used in the plot. Default is empty string.

''
db_name str

Name suffix for saving the figure. Default is empty string.

''
col str

Color for plotting. Default is 'tab:blue'.

'tab:blue'

Returns:

Name Type Description
frac_explained_var ndarray

Fraction of variance explained by each principal component.

Notes

Creates a dual-axis plot showing both individual and cumulative fraction of explained variance. The plot is saved using utf.figsave.

Source code in dod2k_utilities/ut_analysis.py
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
def fraction_of_explained_var(covariance, eigenvalues, n_recs, title='', db_name='', col='tab:blue'):
    """
    Compute and plot the fraction of variance explained by principal components.

    Parameters
    ----------
    covariance : numpy.ndarray
        Covariance matrix of the records.
    eigenvalues : numpy.ndarray
        Eigenvalues from PCA/SVD.
    n_recs : int
        Number of records.
    title : str, optional
        Title used in the plot. Default is empty string.
    db_name : str, optional
        Name suffix for saving the figure. Default is empty string.
    col : str, optional
        Color for plotting. Default is 'tab:blue'.

    Returns
    -------
    frac_explained_var : numpy.ndarray
        Fraction of variance explained by each principal component.

    Notes
    -----
    Creates a dual-axis plot showing both individual and cumulative fraction
    of explained variance. The plot is saved using `utf.figsave`.
    """
    sorter = np.argsort(eigenvalues)[::-1] # sort eigenvalues in descending order

    explained_var  = eigenvalues[sorter]**2/ (n_recs - 1) 

    total_var = np.sum(explained_var)
    frac_explained_var = explained_var / total_var

    cum_frac_explained_var = np.cumsum(frac_explained_var)

    fig = plt.figure()
    plt.title(title)
    ax = plt.gca()
    plt.plot(np.arange(len(frac_explained_var))+1, frac_explained_var, label='fraction of explained variance', color=col, lw=2)
    plt.xlim(-1, 10)
    plt.ylabel('fraction of explained variance')

    plt.xlabel('PC')

    ax1 = ax.twinx()
    ax1.plot(np.arange(len(frac_explained_var))+1, cum_frac_explained_var, ls=':', 
             label='cumulative fraction of explained variance', color=col, lw=2)

    plt.ylabel('cumulative fraction of explained variance') 

    utf.figsave(fig, 'foev_%s'%title, add=db_name)

    return frac_explained_var

gcd(lat1, lon1, lat2, lon2, radius=6378.137)

Calculate 2D great circle distance between points on Earth.

Parameters:

Name Type Description Default
lat1 float or array - like

Latitude(s) of first point(s) in decimal degrees.

required
lon1 float or array - like

Longitude(s) of first point(s) in decimal degrees.

required
lat2 float or array - like

Latitude(s) of second point(s) in decimal degrees.

required
lon2 float or array - like

Longitude(s) of second point(s) in decimal degrees.

required
radius float

Earth radius in kilometers. Default is 6378.137 km (equatorial radius).

6378.137

Returns:

Type Description
float or ndarray

Great circle distance(s) in kilometers.

Notes

Uses the haversine formula for calculating distances on a sphere. Supports vectorized operations for arrays of coordinates.

Examples:

>>> # Distance between London and Paris
>>> gcd(51.5074, -0.1278, 48.8566, 2.3522)
343.5...
>>> # Multiple distances at once
>>> lats1 = [40.7, 51.5, 48.8]
>>> lons1 = [-74.0, -0.1, 2.3]
>>> gcd(lats1, lons1, 52.52, 13.40)  # Distances to Berlin
array([6385.7..., 930.3..., 877.4...])
Source code in dod2k_utilities/ut_analysis.py
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
def gcd(lat1, lon1, lat2, lon2, radius=6378.137):
    """
    Calculate 2D great circle distance between points on Earth.

    Parameters
    ----------
    lat1 : float or array-like
        Latitude(s) of first point(s) in decimal degrees.
    lon1 : float or array-like
        Longitude(s) of first point(s) in decimal degrees.
    lat2 : float or array-like
        Latitude(s) of second point(s) in decimal degrees.
    lon2 : float or array-like
        Longitude(s) of second point(s) in decimal degrees.
    radius : float, optional
        Earth radius in kilometers. Default is 6378.137 km (equatorial radius).

    Returns
    -------
    float or numpy.ndarray
        Great circle distance(s) in kilometers.

    Notes
    -----
    Uses the haversine formula for calculating distances on a sphere.
    Supports vectorized operations for arrays of coordinates.

    Examples
    --------
    >>> # Distance between London and Paris
    >>> gcd(51.5074, -0.1278, 48.8566, 2.3522)
    343.5...

    >>> # Multiple distances at once
    >>> lats1 = [40.7, 51.5, 48.8]
    >>> lons1 = [-74.0, -0.1, 2.3]
    >>> gcd(lats1, lons1, 52.52, 13.40)  # Distances to Berlin
    array([6385.7..., 930.3..., 877.4...])
    """
    # Convert degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    dist = radius * c
    return dist

homogenise_data_dimensions(df, years_hom, title='', print_output=False, plot_output=True)

Homogenise the data arrays to a uniform time coordinate.

This function assigns paleoData values and z-scores from each record in the DataFrame to a homogenised time axis (years_hom). Missing data are masked as zeros using numpy masked arrays. Optional plotting and printing of intermediate checks is provided.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing the records with columns: - 'year': array-like of years - 'paleoData_values': array-like of values - 'paleoData_zscores': array-like of z-scores

required
years_hom array - like

Homogenised time coordinate to which all records are aligned.

required
title str

Title used for the plot if plot_output is True.

''
print_output bool

If True, prints debug information about array sizes and resolutions. Default is False.

False
plot_output bool

If True, generates diagnostic plots showing the homogenised and original paleoData values and z-scores. Default is True.

True

Returns:

Name Type Description
paleoData_values_hom MaskedArray

Masked array of shape (n_records, n_years) containing homogenised paleoData values. Missing values are masked.

paleoData_zscores_hom MaskedArray

Masked array of shape (n_records, n_years) containing homogenised paleoData z-scores. Missing values are masked.

year_hom_avbl list of numpy.ma.MaskedArray

List of length n_records containing the homogenised data arrays for paleoData_values.

zsco_hom_avbl list of numpy.ma.MaskedArray

List of length n_records containing the homogenised data arrays for paleoData_zscores.

Source code in dod2k_utilities/ut_analysis.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
def homogenise_data_dimensions(df, years_hom, title='', print_output=False, plot_output=True):
    """
    Homogenise the data arrays to a uniform time coordinate.

    This function assigns paleoData values and z-scores from each record in the
    DataFrame to a homogenised time axis (`years_hom`). Missing data are masked
    as zeros using numpy masked arrays. Optional plotting and printing of 
    intermediate checks is provided.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the records with columns:
        - 'year': array-like of years
        - 'paleoData_values': array-like of values
        - 'paleoData_zscores': array-like of z-scores
    years_hom : array-like
        Homogenised time coordinate to which all records are aligned.
    title : str, optional
        Title used for the plot if `plot_output` is True.
    print_output : bool, optional
        If True, prints debug information about array sizes and resolutions.
        Default is False.
    plot_output : bool, optional
        If True, generates diagnostic plots showing the homogenised and original
        paleoData values and z-scores. Default is True.

    Returns
    -------
    paleoData_values_hom : numpy.ma.MaskedArray
        Masked array of shape (n_records, n_years) containing homogenised
        paleoData values. Missing values are masked.
    paleoData_zscores_hom : numpy.ma.MaskedArray
        Masked array of shape (n_records, n_years) containing homogenised
        paleoData z-scores. Missing values are masked.
    year_hom_avbl : list of numpy.ma.MaskedArray
        List of length n_records containing the homogenised data arrays for
        paleoData_values.
    zsco_hom_avbl : list of numpy.ma.MaskedArray
        List of length n_records containing the homogenised data arrays for
        paleoData_zscores.
    """

    mny = years_hom[0]
    mxy = years_hom[-1]
    minres = np.unique(np.diff(years_hom))[0]

    n_recs = len(df) 

    # assign data values
    paleoData_values_hom  = np.ma.masked_array(np.zeros([n_recs, len(years_hom)]), np.zeros([n_recs, len(years_hom)]))
    paleoData_zscores_hom = np.ma.masked_array(np.zeros([n_recs, len(years_hom)]), np.zeros([n_recs, len(years_hom)]))

    year_hom_avbl = []
    zsco_hom_avbl = []

    for ijk, ii in enumerate(df.index):
        # create empty data arrays 

        time = df.at[ii, 'year']


        data_LR = np.zeros(len(years_hom))
        data_HR = df.at[ii, 'paleoData_values']

        zsco_HR = df.at[ii, 'paleoData_zscores']
        zsco_LR = np.zeros(len(years_hom))

        tt = []
        zz = []
        for jj, xi in enumerate(years_hom):
            window = (time>xi-minres)&(time<=xi)
            # print(xi, time[window])
            if len(time[window])==0:
                data_LR[jj] = 0#np.nan
                zsco_LR[jj] = 0#np.nan
            else:
                data_LR[jj] = np.average(data_HR[window])
                zsco_LR[jj] = np.average(zsco_HR[window])
                tt+=[xi]
                zz+=[np.average(zsco_HR[window])]


        yh_base = np.ma.masked_array(data_LR, mask=data_LR==0, fill_value=0)
        zh_base = np.ma.masked_array(zsco_LR, mask=zsco_LR==0, fill_value=0)

        # # check array is correct
        if print_output:
            # print(ii, ijk, 'years_hom size: ', years_hom[hmask].shape, 'new array size: ', 
            print(ii, ijk, 'years_hom size: ', years_hom.shape, 'new array size: ', 
                  yh_base[~yh_base.mask].shape, 'resolution: ', np.unique(np.diff(time)), 
                  # 'time coord: from %s-%s'%(yy[(yy>=mny)&(yy<=mxy)][0], yy[(yy>=mny)&(yy<=mxy)][-1])
                 )
            print(paleoData_values_hom[ijk,:].shape, yh_base.shape)  

        paleoData_values_hom[ijk,:]  = yh_base
        paleoData_zscores_hom[ijk,:] = zh_base
        year_hom_avbl.append(tt)
        zsco_hom_avbl.append(zz)

    print(paleoData_values_hom.shape)

    if plot_output:
        n_recs=min(len(df), 50)
        # plot paleoData_values_hom and paleoData_zscores_hom as they appear in df
        fig = plt.figure(figsize=(8,5))
        plt.suptitle(title)
        plt.subplot(221)
        plt.title('paleoData_values HOM')
        for ii in range(n_recs):
            shift = ii
            plt.plot(years_hom, paleoData_values_hom[ii,:]+shift, lw=1)
        plt.xlim(mny, mxy)

        plt.subplot(222)
        plt.title('paleoData_values')
        for ii in range(n_recs):
            shift = ii
            plt.plot(df.year.iloc[ii], df.paleoData_values.iloc[ii]+shift, lw=1)
        plt.xlim(mny, mxy)

        plt.subplot(223)
        plt.title('paleoData_zscores HOM')
        for ii in range(n_recs):
            shift = ii
            plt.plot(years_hom, paleoData_zscores_hom[ii,:]+shift , lw=1)
        plt.xlim(mny, mxy)

        plt.subplot(224)
        plt.title('paleoData_zscores')
        for ii in range(n_recs):
            shift = ii
            plt.plot(df.year.iloc[ii], df.paleoData_zscores.iloc[ii]+shift, lw=1)
        plt.xlim(mny, mxy)
        fig.tight_layout()
    return paleoData_values_hom, paleoData_zscores_hom, year_hom_avbl, zsco_hom_avbl  

homogenise_time(df, mny, mxy, minres)

Homogenise the time coordinate of records in a DataFrame.

This function creates a uniform time axis from mny to mxy with steps of minres years and prints basic information about the homogenised timeline. It also calls find_shared_period to report the overlapping period across all records.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing climate or archive records. Must be compatible with find_shared_period.

required
mny int

Start year of the homogenisation period.

required
mxy int

End year of the homogenisation period.

required
minres int

Minimum resolution (in years) for the homogenised timeline.

required

Returns:

Name Type Description
df DataFrame

The input DataFrame (unchanged in this function).

years_hom ndarray

Array of homogenised years from mny to mxy with step minres.

Source code in dod2k_utilities/ut_analysis.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def homogenise_time(df, mny, mxy, minres):
    """
    Homogenise the time coordinate of records in a DataFrame.

    This function creates a uniform time axis from `mny` to `mxy` with 
    steps of `minres` years and prints basic information about the 
    homogenised timeline. It also calls `find_shared_period` to report 
    the overlapping period across all records.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing climate or archive records. Must be compatible
        with `find_shared_period`.
    mny : int
        Start year of the homogenisation period.
    mxy : int
        End year of the homogenisation period.
    minres : int
        Minimum resolution (in years) for the homogenised timeline.

    Returns
    -------
    df : pandas.DataFrame
        The input DataFrame (unchanged in this function).
    years_hom : numpy.ndarray
        Array of homogenised years from `mny` to `mxy` with step `minres`.
    """

    years_hom     = np.arange(mny, mxy+minres, minres)                                       #

    print('Homogenised time coordinate: %d-%d CE'%(years_hom[0], years_hom[-1]))
    print('Resolution: %s years'%str(np.unique(np.diff(years_hom))))

    find_shared_period(df, minmax=(mny, mxy))

    return df, years_hom

smooth(data, time, res)

Apply simple moving average smoothing to time series data.

Parameters:

Name Type Description Default
data ndarray

Time series data array (1D or 2D). If 2D, smoothing is applied along the first dimension (rows).

required
time ndarray

Corresponding time axis array. Should have same length as first dimension of data.

required
res int

Window size for smoothing (number of points). The moving average uses non-overlapping windows.

required

Returns:

Name Type Description
smooth_time list

Smoothed time axis values, containing the mean time value for each window.

smooth_data list

Smoothed data values, containing the mean data value for each window.

Notes

This function uses a simple non-overlapping moving average with window size res. The output length will be approximately len(data) / res (rounded down).

Source code in dod2k_utilities/ut_analysis.py
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
def smooth(data, time, res):
    """
    Apply simple moving average smoothing to time series data.

    Parameters
    ----------
    data : numpy.ndarray
        Time series data array (1D or 2D). If 2D, smoothing is applied
        along the first dimension (rows).
    time : numpy.ndarray
        Corresponding time axis array. Should have same length as first
        dimension of data.
    res : int
        Window size for smoothing (number of points). The moving average
        uses non-overlapping windows.

    Returns
    -------
    smooth_time : list
        Smoothed time axis values, containing the mean time value for
        each window.
    smooth_data : list
        Smoothed data values, containing the mean data value for
        each window.

    Notes
    -----
    This function uses a simple non-overlapping moving average with
    window size `res`. The output length will be approximately
    len(data) / res (rounded down).
    """
    smooth_data = []
    smooth_time = []
    for ii in range(0, data.shape[0], 1):
        smooth_data += [np.mean(data[ii:ii+res])]
        smooth_time += [np.mean(time[ii:ii+res])]
    return smooth_time, smooth_data