clophfit.fitting.utils
======================

.. py:module:: clophfit.fitting.utils

.. autoapi-nested-parse::

   Utility functions for fitting modules.


Functions
---------

.. autoapisummary::

   clophfit.fitting.utils.parse_remove_outliers
   clophfit.fitting.utils.identify_outliers_zscore
   clophfit.fitting.utils.reweight_from_residuals
   clophfit.fitting.utils.flag_trend_outliers
   clophfit.fitting.utils.fit_trendline
   clophfit.fitting.utils.smoothness
   clophfit.fitting.utils.roughness
   clophfit.fitting.utils.outlier_scores_extended
   clophfit.fitting.utils.apply_outlier_mask
   clophfit.fitting.utils.fit_rel_error_from_residuals
   clophfit.fitting.utils.fit_noise_model_nnls
   clophfit.fitting.utils.fit_noise_model_from_residuals
   clophfit.fitting.utils.fit_gain_and_rel_error_from_residuals
   clophfit.fitting.utils.compute_binding_slope
   clophfit.fitting.utils.compute_plate_slopes
   clophfit.fitting.utils.fit_ph_slope_noise


Module Contents
---------------

.. py:function:: parse_remove_outliers(spec)

   Parse outlier specification ``"method:threshold:min_keep"``.

   :param spec: The string to parse.
   :type spec: str

   :returns: A tuple of `method`, `threshold`, `min_keep`.
   :rtype: tuple[str, float, int]

   .. rubric:: Examples

   - ``"zscore:2.5:5"`` -> ("zscore", 2.5, 5)
   - ``"method"`` -> ("method", 2.0, 1)


.. py:function:: identify_outliers_zscore(residuals, threshold = 2.0)

   Identify outliers using the Z-score method on a 1D array of residuals.

   :param residuals: The residuals to analyze.
   :type residuals: np.ndarray
   :param threshold: The Z-score threshold beyond which a point is considered an outlier.
   :type threshold: float

   :returns: A boolean mask where True indicates an outlier.
   :rtype: ArrayMask


.. py:function:: reweight_from_residuals(ds, residuals)

   Update y_errc in a Dataset from the mean absolute residuals of each label.

   :param ds: The input dataset.
   :type ds: Dataset
   :param residuals: The combined 1D array of residuals for all labels in the dataset,
                     in the order of ds.values().
   :type residuals: np.ndarray

   :returns: A new dataset with updated y_err.
   :rtype: Dataset


.. py:function:: flag_trend_outliers(x, y, threshold = 3.0)

   Flag outliers using robust Theil-Sen regression of y on x.

   A point is flagged if its residual is far from the trendline (Z-score < -threshold)
   OR if its x-value is extremely low compared to the population (Z-score < -threshold).

   :param x: The independent variable (e.g., maximum signal, mean).
   :type x: pd.Series
   :param y: The dependent variable (e.g., signal span, std, or dynamic range).
   :type y: pd.Series
   :param threshold: The Z-score threshold for flagging an outlier.
   :type threshold: float

   :returns: A boolean Series of the same length as x, True for outliers.
   :rtype: pd.Series


.. py:function:: fit_trendline(x, y)

   Fit a robust Theil-Sen regression line.

   :param x: The independent variable.
   :type x: pd.Series
   :param y: The dependent variable.
   :type y: pd.Series

   :returns: Slope and intercept.
   :rtype: tuple[float, float]


.. py:function:: smoothness(y)

   Calculate the smoothness of a curve.

   Sum of \|consecutive diffs\| / total span.
   = 1 for perfectly monotone, > 1 for noisy/non-monotone.

   :param y: The signal array.
   :type y: np.ndarray

   :returns: The smoothness value.
   :rtype: float


.. py:function:: roughness(y)

   Calculate the roughness of a curve.

   Excess path fraction: 0 = perfectly monotone, 1 = all noise, flat-safe.
   roughness = (consec - span) / consec.

   :param y: The signal array.
   :type y: np.ndarray

   :returns: The roughness value.
   :rtype: float


.. py:function:: outlier_scores_extended(x, y)

   Compute outlier scores for each point using geometric deviation.

   Uses a hybrid approach for edge points:
   - If `edge_step` > 2 * `local_step`: anomalously large jump → use full projection deviation
   - Elif wrong direction (reversal): use projection deviation
   - Else (correct direction / plateau approach): score = 0

   For internal points: triangle inequality score.

   :param x: x-values (e.g. pH or concentration).
   :type x: np.ndarray
   :param y: Observed y-values.
   :type y: np.ndarray

   :returns: Per-point outlier scores (non-negative; higher = more anomalous).
   :rtype: np.ndarray

   .. rubric:: Examples

   >>> import numpy as np
   >>> x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
   >>> y = np.array([10.0, 8.0, 15.0, 4.0, 2.0])
   >>> scores = outlier_scores_extended(x, y)
   >>> bool(scores[2] > 0.4)
   True


.. py:function:: apply_outlier_mask(ds, threshold = 0.2, min_keep = 3)

   Mask outlier points iteratively in each DataArray of a Dataset.

   Removes the single worst outlier (if above threshold) and recomputes
   scores, repeating until no score exceeds the threshold or fewer than
   min_keep unmasked points remain.

   :param ds: Dataset to process (deep-copied; input is not modified).
   :type ds: Dataset
   :param threshold: Outlier score above which a point is masked. Default is 0.2.
   :type threshold: float, optional
   :param min_keep: Minimum number of unmasked points to retain. Default is 3.
   :type min_keep: int, optional

   :returns: A new Dataset with outlier points masked.
   :rtype: Dataset


.. py:function:: fit_rel_error_from_residuals(df, sigma_floor)

   Estimate proportional error (alpha) per label via moment estimator.

   Assumes the simplified noise model ``sigma^2 = floor^2 + alpha^2 * that^2``
   (no Poisson gain term). With ``floor`` known from buffer measurements
   and using model-predicted values ``that`` in the denominator to avoid
   noise-in-variables bias, the closed-form moment estimator is:

   .. math::

       \hat{\alpha}^2 =
       \frac{\overline{r^2} - \sigma_{\text{floor}}^2}{\overline{\hat{y}^2}}

   :param df: DataFrame with columns ``label`` (str), ``resid_raw`` (float), and
              ``predicted`` (float -- the model-predicted signal at each point).
              Typically from :func:`clophfit.fitting.residuals.collect_multi_residuals`.
   :type df: pd.DataFrame
   :param sigma_floor: Known read-noise floor per label, e.g. from ``tit.bg_noise``.
   :type sigma_floor: dict[str, float]

   :returns: Per-label proportional error estimate ``alpha`` (non-negative).
   :rtype: dict[str, float]

   .. rubric:: Examples

   >>> import numpy as np, pandas as pd
   >>> rng = np.random.default_rng(0)
   >>> y_pred = np.linspace(50, 500, 200)
   >>> floor, true_alpha = 5.0, 0.02
   >>> sigma = np.sqrt(floor**2 + (true_alpha * y_pred) ** 2)
   >>> resid = sigma * rng.standard_normal(200)
   >>> df = pd.DataFrame({"label": "1", "resid_raw": resid, "predicted": y_pred})
   >>> alpha = fit_rel_error_from_residuals(df, sigma_floor={"1": floor})
   >>> round(alpha["1"], 2)  # should be close to true_alpha=0.02
   0.02


.. py:function:: fit_noise_model_nnls(df, sigma_floor_fixed = None, rel_error_fixed = None)

   Fit heteroscedastic noise model via non-negative least squares.

   Model:  :math:`\sigma^2 = \sigma_\text{floor}^2 + \text{gain} \cdot y
   + \alpha^2 \cdot y^2`

   Uses :func:`scipy.optimize.nnls` to enforce non-negativity on all
   parameters, which stabilises estimates when :math:`y` and :math:`y^2`
   are highly collinear (typical for narrow-range titrations).

   :param df: Residual DataFrame with columns ``label``, ``resid_raw``, ``predicted``.
   :type df: pd.DataFrame
   :param sigma_floor_fixed: If given, fix floor per label and only fit gain and alpha.
   :type sigma_floor_fixed: dict[str, float] | None
   :param rel_error_fixed: If given, fix alpha per label and only fit floor and gain.
   :type rel_error_fixed: dict[str, float] | None

   :returns: ``(sigma_floor, gain, alpha)`` per label — all non-negative.
   :rtype: tuple[dict[str, float], dict[str, float], dict[str, float]]

   :raises ValueError: If both *sigma_floor_fixed* and *rel_error_fixed* are provided.


.. py:function:: fit_noise_model_from_residuals(df, rel_error = 0.003)

   Fit per-label noise model parameters from first-pass residuals.

   With ``rel_error`` fixed, the noise equation becomes linear in two unknowns
   via non-negative least squares.

   :param df: DataFrame with columns ``label``, ``resid_raw``, ``predicted``.
   :type df: pd.DataFrame
   :param rel_error: Fixed proportional error. A single float is broadcast to all labels.
                     Default is 0.003.
   :type rel_error: float | dict[str, float], optional

   :returns: ``(sigma_floor_dict, gain_dict)`` per label (non-negative).
   :rtype: tuple[dict[str, float], dict[str, float]]


.. py:function:: fit_gain_and_rel_error_from_residuals(df, sigma_floor)

   Fit gain and rel_error per label from residuals with known floor.

   Uses non-negative least squares on ``r^2 - floor^2 = gain * y + alpha^2 * y^2``
   to handle collinearity between :math:`y` and :math:`y^2`.

   :param df: DataFrame with columns ``label``, ``resid_raw``, ``predicted``.
   :type df: pd.DataFrame
   :param sigma_floor: Known noise floor per label.
   :type sigma_floor: dict[str, float]

   :returns: ``(gain_dict, rel_error_dict)`` per label (non-negative).
   :rtype: tuple[dict[str, float], dict[str, float]]


.. py:function:: compute_binding_slope(ph, pka, s0, s1)

   Compute |dS/dpH| for the Henderson-Hasselbalch equation.

   ``dS/dpH = (s1 - s0) * ln(10) * t / (1 + t)^2`` where ``t = 10^(pka - ph)``.
   Returns the absolute value (sign irrelevant for variance).


.. py:function:: compute_plate_slopes(results)

   Compute per-well per-label ``∂S/∂pH`` from pass-1 fit results.

   :param results: Fit results keyed by well (must have ``.result`` and ``.dataset``).
   :type results: dict[str, typing.Any]

   :returns: ``{well: {label: slope_array}}``.
   :rtype: dict[str, dict[str, np.ndarray]]


.. py:function:: fit_ph_slope_noise(df, noise_model, plate_slopes)

   Fit global ``sigma_ph`` from excess variance after per-label model.

   After subtracting the per-label noise model variance, the leftover
   ``r^2 - var_model`` is regressed against ``(dS/dpH)^2`` via NNLS.

   :param df: Residual DataFrame with columns ``label``, ``well``, ``resid_raw``,
              ``predicted``, and ``raw_i``.
   :type df: pd.DataFrame
   :param noise_model: Per-label noise model (floor, gain, alpha) fitted in the same pass.
   :type noise_model: PlateNoiseModel
   :param plate_slopes: Per-well per-label derivative ``|dS/dpH|`` arrays.
   :type plate_slopes: dict[str, dict[str, np.ndarray]]

   :returns: Global ``sigma_ph`` estimate (>= 0).
   :rtype: float