pidp_tools.analysis module

This module contains all of the tools that are most likely to be useful for analysis.

class pidp_tools.analysis.ConfusionMatrix(labels, predictions, title='', purity=False, label_selection='necessary')[source]

Bases: object

Creates a confusion matrix based on a collection of labels and predictions.

Parameters

labels : list

A list of strings or integers that represent the true particle type for a series of events.

predictions : list

A list of strings or integers that represent the predicted particle type for a series of events.

title : str, default “”

The title of the confusion matrix.

purity : bool, default False

Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.

label_selection : {“all”, “charge”, “necessary”}, default “necessary”

The way to determine which columns and rows to include in the confusion matrix:

  • “all” : Includes all particle rows and columns, even if they are entirely empty.

  • “charge” : Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.

  • “necessary” : Includes only those particles that are included in the labels or predictions.

calculate_matrix(labels, predictions)[source]

Calculates the confusion matrix based on a collection of labels and predictions.

Parameters

labels : list

A list of integers that represent the true particle type for a series of events.

predictions : list

A list of integers that represent the predicted particle type for a series of events.

display_matrix(title)[source]

Displays the confusion matrix.

Parameters

title : str, default “”

The title of the confusion matrix.

classmethod from_estimator(estimator, df, target='Generated As', title='', purity=False, label_selection='necessary')[source]

Creates a confusion matrix based on the predictions made by the provided estimator.

Parameters

estimator : function or method

The estimator to be used to identify particles. Estimators can take either rows of a dataframe and return a string (to be compatible with the .apply method of the dataframe object), or can take in an entire dataframe and return a series of strings.

df : pandas.DataFrame

The dataframe whose rows represent particles that can be identified by the estimator. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.

target : str, default “Generated As”

The target of the estimator. The supplied dataframe must have a column with this label.

title : str, default “”

The title of the confusion matrix.

purity : bool, default False

Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.

label_selection : {“all”, “charge”, “necessary”}, default “necessary”

The way to determine which columns and rows to include in the confusion matrix:

  • “all” : Includes all particle rows and columns, even if they are entirely empty.

  • “charge” : Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.

  • “necessary” : Includes only those particles that are included in the labels or predictions.

Returns

ConfusionMatrix

classmethod from_model(model, df, target='Generated As', title='', purity=False, match_hypothesis=False, label_selection='charge')[source]

Creates a confusion matrix based on the predictions made by the provided model.

Parameters

model : Any scikit-learn trained model with “predict” and “predict_proba” methods.

The model to be used to predict the particle type of the particles supplied in the dataframe.

df : pandas.DataFrame

The dataframe whose rows represent particles that can be identified by the model. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.

target : str, default “Generated As”

The target of the model. The supplied dataframe must have a column with this label.

title : str, default “”

The title of the confusion matrix.

purity : bool, default False

Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.

match_hypothesis : bool, default False:

Require predictions to match the supplied hypothesis. If True, only considers predictions that match the hypothesis. Neutral particles, which have no hypothesis, are still considered in the typical sense. If False, the prediction of the model is the most frequent prediction among all hypotheses.

label_selection : {“all”, “charge”, “necessary”}, default “necessary”

The way to determine which columns and rows to include in the confusion matrix:

  • “all”: Includes all particle rows and columns, even if they are entirely empty.

  • “charge”: Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.

  • “necessary”: Includes only those particles that are included in the labels or predictions.

Returns

ConfusionMatrix

pidp_tools.analysis.feature_importance(model, df, target='Generated As', match_hypothesis=False, n_repetitions=3, n_each=100)[source]

Calculates and plots the permutation feature importances of the features supplied to the provided model.

Parameters

model : Any scikit-learn trained model with “predict” and “predict_proba” methods.

The model to be used to predict the particle type of the particles supplied in the dataframe.

df : pandas.DataFrame

The dataframe whose rows represent particles that can be identified by the model. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.

target : str, default “Generated As”

The target of the model. The supplied dataframe must have a column with this label.

match_hypothesis : bool, default False:

Require predictions to match the supplied hypothesis. If True, only considers predictions that match the hypothesis. Neutral particles, which have no hypothesis, are still considered in the typical sense. If False, the prediction of the model is the most frequent prediction among all hypotheses.

n_repetitions : int, default 3

The number of times to permute each feature. The feature importance is the average accuracy over all of the repetitions.

n_each : int, default 100

The number of events of each particle type to include in each permutation test.

pidp_tools.analysis.get_charge(ptype)[source]

Returns the charge of the provided particle

Parameters

ptype : str or int

The particle to find the charge of. If int, the particle is assumed to be the element of the following list with the corresponding index: [“Photon”, “KLong”, “Neutron”, “Proton”, “K+”, “Pi+”, “AntiMuon”, “Positron”, “AntiProton”, “K-”, “Pi-”, “Muon”, “Electron”, “No ID”].

Examples

>>> get_charge("AntiProton")
-1
>>> get_charge(3)
1
pidp_tools.analysis.grab_events(input_df, n_each=5000, reverse=False, return_strings=False, allow_less=False)[source]

Grabs the selected number of events for each particle type, preserving hypothesis groups.

Parameters

input_df : pandas.DataFrame

The dataframe to grab events from. The supplied dataframe should have a “Number of Hypotheses” column.

n_each : int, default 5000

The number of events of each particle type to include in the resulting dataset. The number of events for each particle type may be smaller if “allow_less” is True.

reverse : bool, default False

Grab events from the end of the dataframe first. If True, events are grabbed from the end of the file first.

return_strings : bool, default False

Return a dataframe in which the “Hypothesis” and “Generated As” columns contain strings instead of integers. If True, the returned dataframe will have strings in the “Hypothesis” and “Generated As” columns.

allow_less : bool, default False

Allow the final dataframe to have fewer than the requested number of events if not enough data is available. If True, the resulting dataframe may not have the requested number of events for each particle, and the number of events may be different for each particle type.

Returns

smaller_dataset : pandas.DataFrame

A dataframe containing the events grabbed from the input dataframe.

pidp_tools.analysis.install_ROOT()[source]

Installs ROOT.

Examples

>>> from pidp_tools import \*
>>> install_ROOT()
>>> from ROOT import \*
pidp_tools.analysis.load_model(path='my_model.joblib')[source]

Loads a model from a joblib dump at the specified path.

Parameters

path : str, default “my_model.joblib”

The path to the model save location.

Returns

model : Scikit-learn trained model.

The loaded scikit-learn model.

pidp_tools.analysis.round_accuracies(num)[source]

Rounds a number to 2 decimal places. If the rounded number is 0.00 or 1.00, an int is returned.

Parameters

num : float or int

The number to round.

Examples

>>> round_accuracies(0.3333333)
0.33
>>> round_accuracies(0.001)
0
pidp_tools.analysis.save_model(model, path='my_model.joblib')[source]

Saves a model as a joblib dump at the specified path.

Parameters

model : Any scikit-learn trained model.

The model to be saved.

path : str, default “my_model.joblib”

The path to the model save location.

pidp_tools.analysis.split_df(input_df, training_fraction=0.9)[source]

Splits the supplied dataframe into training data and test data, preserving hypothesis groups.

Parameters

input_df : pandas.DataFrame

The dataframe to split. The supplied dataframe should have a “Number of Hypotheses” column.

training_fraction : float, default 0.9

The fraction of events to be included in the training dataset. All remaining events will be included in the test dataset.

Returns

training : pandas.DataFrame

A dataframe containing the requested fraction of the input data.

test : pandas.DataFrame

A dataframe containing the rows of the input data not included in the training dataset.