clophfit.fitting.utils#

Utility functions for fitting modules.

Functions#

parse_remove_outliers(spec)

Parse outlier specification "method:threshold:min_keep".

identify_outliers_zscore(residuals[, threshold])

Identify outliers using the Z-score method on a 1D array of residuals.

reweight_from_residuals(ds, residuals)

Update y_errc in a Dataset from the mean absolute residuals of each label.

flag_trend_outliers(x, y[, threshold])

Flag outliers using robust Theil-Sen regression of y on x.

fit_trendline(x, y)

Fit a robust Theil-Sen regression line.

smoothness(y)

Calculate the smoothness of a curve.

roughness(y)

Calculate the roughness of a curve.

outlier_scores_extended(x, y)

Compute outlier scores for each point using geometric deviation.

apply_outlier_mask(ds[, threshold, min_keep])

Mask outlier points iteratively in each DataArray of a Dataset.

fit_rel_error_from_residuals(df, sigma_floor)

Estimate proportional error (alpha) per label via moment estimator.

fit_noise_model_nnls(df[, sigma_floor_fixed, ...])

Fit heteroscedastic noise model via non-negative least squares.

fit_noise_model_from_residuals(df[, rel_error])

Fit per-label noise model parameters from first-pass residuals.

fit_gain_and_rel_error_from_residuals(df, sigma_floor)

Fit gain and rel_error per label from residuals with known floor.

compute_binding_slope(ph, pka, s0, s1)

Compute |dS/dpH| for the Henderson-Hasselbalch equation.

compute_plate_slopes(results)

Compute per-well per-label ∂S/∂pH from pass-1 fit results.

fit_ph_slope_noise(df, noise_model, plate_slopes)

Fit global sigma_ph from excess variance after per-label model.

Module Contents#

clophfit.fitting.utils.parse_remove_outliers(spec)#

Parse outlier specification "method:threshold:min_keep".

Parameters:

spec (str) – The string to parse.

Returns:

A tuple of method, threshold, min_keep.

Return type:

tuple[str, float, int]

Examples

  • "zscore:2.5:5" -> (“zscore”, 2.5, 5)

  • "method" -> (“method”, 2.0, 1)

clophfit.fitting.utils.identify_outliers_zscore(residuals, threshold=2.0)#

Identify outliers using the Z-score method on a 1D array of residuals.

Parameters:
  • residuals (np.ndarray) – The residuals to analyze.

  • threshold (float) – The Z-score threshold beyond which a point is considered an outlier.

Returns:

A boolean mask where True indicates an outlier.

Return type:

ArrayMask

clophfit.fitting.utils.reweight_from_residuals(ds, residuals)#

Update y_errc in a Dataset from the mean absolute residuals of each label.

Parameters:
  • ds (Dataset) – The input dataset.

  • residuals (np.ndarray) – The combined 1D array of residuals for all labels in the dataset, in the order of ds.values().

Returns:

A new dataset with updated y_err.

Return type:

Dataset

clophfit.fitting.utils.flag_trend_outliers(x, y, threshold=3.0)#

Flag outliers using robust Theil-Sen regression of y on x.

A point is flagged if its residual is far from the trendline (Z-score < -threshold) OR if its x-value is extremely low compared to the population (Z-score < -threshold).

Parameters:
  • x (pd.Series) – The independent variable (e.g., maximum signal, mean).

  • y (pd.Series) – The dependent variable (e.g., signal span, std, or dynamic range).

  • threshold (float) – The Z-score threshold for flagging an outlier.

Returns:

A boolean Series of the same length as x, True for outliers.

Return type:

pd.Series

clophfit.fitting.utils.fit_trendline(x, y)#

Fit a robust Theil-Sen regression line.

Parameters:
  • x (pd.Series) – The independent variable.

  • y (pd.Series) – The dependent variable.

Returns:

Slope and intercept.

Return type:

tuple[float, float]

clophfit.fitting.utils.smoothness(y)#

Calculate the smoothness of a curve.

Sum of |consecutive diffs| / total span. = 1 for perfectly monotone, > 1 for noisy/non-monotone.

Parameters:

y (np.ndarray) – The signal array.

Returns:

The smoothness value.

Return type:

float

clophfit.fitting.utils.roughness(y)#

Calculate the roughness of a curve.

Excess path fraction: 0 = perfectly monotone, 1 = all noise, flat-safe. roughness = (consec - span) / consec.

Parameters:

y (np.ndarray) – The signal array.

Returns:

The roughness value.

Return type:

float

clophfit.fitting.utils.outlier_scores_extended(x, y)#

Compute outlier scores for each point using geometric deviation.

Uses a hybrid approach for edge points: - If edge_step > 2 * local_step: anomalously large jump → use full projection deviation - Elif wrong direction (reversal): use projection deviation - Else (correct direction / plateau approach): score = 0

For internal points: triangle inequality score.

Parameters:
  • x (np.ndarray) – x-values (e.g. pH or concentration).

  • y (np.ndarray) – Observed y-values.

Returns:

Per-point outlier scores (non-negative; higher = more anomalous).

Return type:

np.ndarray

Examples

>>> import numpy as np
>>> x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
>>> y = np.array([10.0, 8.0, 15.0, 4.0, 2.0])
>>> scores = outlier_scores_extended(x, y)
>>> bool(scores[2] > 0.4)
True
clophfit.fitting.utils.apply_outlier_mask(ds, threshold=0.2, min_keep=3)#

Mask outlier points iteratively in each DataArray of a Dataset.

Removes the single worst outlier (if above threshold) and recomputes scores, repeating until no score exceeds the threshold or fewer than min_keep unmasked points remain.

Parameters:
  • ds (Dataset) – Dataset to process (deep-copied; input is not modified).

  • threshold (float, optional) – Outlier score above which a point is masked. Default is 0.2.

  • min_keep (int, optional) – Minimum number of unmasked points to retain. Default is 3.

Returns:

A new Dataset with outlier points masked.

Return type:

Dataset

clophfit.fitting.utils.fit_rel_error_from_residuals(df, sigma_floor)#

Estimate proportional error (alpha) per label via moment estimator.

Assumes the simplified noise model sigma^2 = floor^2 + alpha^2 * that^2 (no Poisson gain term). With floor known from buffer measurements and using model-predicted values that in the denominator to avoid noise-in-variables bias, the closed-form moment estimator is:

\[\hat{\alpha}^2 = \frac{\overline{r^2} - \sigma_{\text{floor}}^2}{\overline{\hat{y}^2}}\]
Parameters:
  • df (pd.DataFrame) – DataFrame with columns label (str), resid_raw (float), and predicted (float – the model-predicted signal at each point). Typically from clophfit.fitting.residuals.collect_multi_residuals().

  • sigma_floor (dict[str, float]) – Known read-noise floor per label, e.g. from tit.bg_noise.

Returns:

Per-label proportional error estimate alpha (non-negative).

Return type:

dict[str, float]

Examples

>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> y_pred = np.linspace(50, 500, 200)
>>> floor, true_alpha = 5.0, 0.02
>>> sigma = np.sqrt(floor**2 + (true_alpha * y_pred) ** 2)
>>> resid = sigma * rng.standard_normal(200)
>>> df = pd.DataFrame({"label": "1", "resid_raw": resid, "predicted": y_pred})
>>> alpha = fit_rel_error_from_residuals(df, sigma_floor={"1": floor})
>>> round(alpha["1"], 2)  # should be close to true_alpha=0.02
0.02
clophfit.fitting.utils.fit_noise_model_nnls(df, sigma_floor_fixed=None, rel_error_fixed=None)#

Fit heteroscedastic noise model via non-negative least squares.

Model: \(\sigma^2 = \sigma_\text{floor}^2 + \text{gain} \cdot y + \alpha^2 \cdot y^2\)

Uses scipy.optimize.nnls() to enforce non-negativity on all parameters, which stabilises estimates when \(y\) and \(y^2\) are highly collinear (typical for narrow-range titrations).

Parameters:
  • df (pd.DataFrame) – Residual DataFrame with columns label, resid_raw, predicted.

  • sigma_floor_fixed (dict[str, float] | None) – If given, fix floor per label and only fit gain and alpha.

  • rel_error_fixed (dict[str, float] | None) – If given, fix alpha per label and only fit floor and gain.

Returns:

(sigma_floor, gain, alpha) per label — all non-negative.

Return type:

tuple[dict[str, float], dict[str, float], dict[str, float]]

Raises:

ValueError – If both sigma_floor_fixed and rel_error_fixed are provided.

clophfit.fitting.utils.fit_noise_model_from_residuals(df, rel_error=0.003)#

Fit per-label noise model parameters from first-pass residuals.

With rel_error fixed, the noise equation becomes linear in two unknowns via non-negative least squares.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns label, resid_raw, predicted.

  • rel_error (float | dict[str, float], optional) – Fixed proportional error. A single float is broadcast to all labels. Default is 0.003.

Returns:

(sigma_floor_dict, gain_dict) per label (non-negative).

Return type:

tuple[dict[str, float], dict[str, float]]

clophfit.fitting.utils.fit_gain_and_rel_error_from_residuals(df, sigma_floor)#

Fit gain and rel_error per label from residuals with known floor.

Uses non-negative least squares on r^2 - floor^2 = gain * y + alpha^2 * y^2 to handle collinearity between \(y\) and \(y^2\).

Parameters:
  • df (pd.DataFrame) – DataFrame with columns label, resid_raw, predicted.

  • sigma_floor (dict[str, float]) – Known noise floor per label.

Returns:

(gain_dict, rel_error_dict) per label (non-negative).

Return type:

tuple[dict[str, float], dict[str, float]]

clophfit.fitting.utils.compute_binding_slope(ph, pka, s0, s1)#

Compute |dS/dpH| for the Henderson-Hasselbalch equation.

dS/dpH = (s1 - s0) * ln(10) * t / (1 + t)^2 where t = 10^(pka - ph). Returns the absolute value (sign irrelevant for variance).

Parameters:
  • ph (numpy.ndarray)

  • pka (float)

  • s0 (float)

  • s1 (float)

Return type:

numpy.ndarray

clophfit.fitting.utils.compute_plate_slopes(results)#

Compute per-well per-label ∂S/∂pH from pass-1 fit results.

Parameters:

results (dict[str, Any]) – Fit results keyed by well (must have .result and .dataset).

Returns:

{well: {label: slope_array}}.

Return type:

dict[str, dict[str, np.ndarray]]

clophfit.fitting.utils.fit_ph_slope_noise(df, noise_model, plate_slopes)#

Fit global sigma_ph from excess variance after per-label model.

After subtracting the per-label noise model variance, the leftover r^2 - var_model is regressed against (dS/dpH)^2 via NNLS.

Parameters:
  • df (pd.DataFrame) – Residual DataFrame with columns label, well, resid_raw, predicted, and raw_i.

  • noise_model (PlateNoiseModel) – Per-label noise model (floor, gain, alpha) fitted in the same pass.

  • plate_slopes (dict[str, dict[str, np.ndarray]]) – Per-well per-label derivative |dS/dpH| arrays.

Returns:

Global sigma_ph estimate (>= 0).

Return type:

float