pidp_tools.analysis module¶
This module contains all of the tools that are most likely to be useful for analysis.
- class pidp_tools.analysis.ConfusionMatrix(labels, predictions, title='', purity=False, label_selection='necessary')[source]¶
Bases:
object
Creates a confusion matrix based on a collection of labels and predictions.
Parameters¶
- labels : list
A list of strings or integers that represent the true particle type for a series of events.
- predictions : list
A list of strings or integers that represent the predicted particle type for a series of events.
- title : str, default “”
The title of the confusion matrix.
- purity : bool, default False
Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.
- label_selection : {“all”, “charge”, “necessary”}, default “necessary”
The way to determine which columns and rows to include in the confusion matrix:
“all” : Includes all particle rows and columns, even if they are entirely empty.
“charge” : Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.
“necessary” : Includes only those particles that are included in the labels or predictions.
- calculate_matrix(labels, predictions)[source]¶
Calculates the confusion matrix based on a collection of labels and predictions.
Parameters¶
- labels : list
A list of integers that represent the true particle type for a series of events.
- predictions : list
A list of integers that represent the predicted particle type for a series of events.
- display_matrix(title)[source]¶
Displays the confusion matrix.
Parameters¶
- title : str, default “”
The title of the confusion matrix.
- classmethod from_estimator(estimator, df, target='Generated As', title='', purity=False, label_selection='necessary')[source]¶
Creates a confusion matrix based on the predictions made by the provided estimator.
Parameters¶
- estimator : function or method
The estimator to be used to identify particles. Estimators can take either rows of a dataframe and return a string (to be compatible with the .apply method of the dataframe object), or can take in an entire dataframe and return a series of strings.
- df :
pandas.DataFrame
The dataframe whose rows represent particles that can be identified by the estimator. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.
- target : str, default “Generated As”
The target of the estimator. The supplied dataframe must have a column with this label.
- title : str, default “”
The title of the confusion matrix.
- purity : bool, default False
Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.
- label_selection : {“all”, “charge”, “necessary”}, default “necessary”
The way to determine which columns and rows to include in the confusion matrix:
“all” : Includes all particle rows and columns, even if they are entirely empty.
“charge” : Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.
“necessary” : Includes only those particles that are included in the labels or predictions.
Returns¶
- classmethod from_model(model, df, target='Generated As', title='', purity=False, match_hypothesis=False, label_selection='charge')[source]¶
Creates a confusion matrix based on the predictions made by the provided model.
Parameters¶
- model : Any scikit-learn trained model with “predict” and “predict_proba” methods.
The model to be used to predict the particle type of the particles supplied in the dataframe.
- df :
pandas.DataFrame
The dataframe whose rows represent particles that can be identified by the model. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.
- target : str, default “Generated As”
The target of the model. The supplied dataframe must have a column with this label.
- title : str, default “”
The title of the confusion matrix.
- purity : bool, default False
Normalize confusion matrix by columns instead of rows. If True, the sum of column values will be normalized to 1.
- match_hypothesis : bool, default False:
Require predictions to match the supplied hypothesis. If True, only considers predictions that match the hypothesis. Neutral particles, which have no hypothesis, are still considered in the typical sense. If False, the prediction of the model is the most frequent prediction among all hypotheses.
- label_selection : {“all”, “charge”, “necessary”}, default “necessary”
The way to determine which columns and rows to include in the confusion matrix:
“all”: Includes all particle rows and columns, even if they are entirely empty.
“charge”: Includes all of the particle types included in the labels and predictions, plus all particles of the same charge category (charged, neutral) as those included in the labels and predictions.
“necessary”: Includes only those particles that are included in the labels or predictions.
Returns¶
- pidp_tools.analysis.feature_importance(model, df, target='Generated As', match_hypothesis=False, n_repetitions=3, n_each=100)[source]¶
Calculates and plots the permutation feature importances of the features supplied to the provided model.
Parameters¶
- model : Any scikit-learn trained model with “predict” and “predict_proba” methods.
The model to be used to predict the particle type of the particles supplied in the dataframe.
- df :
pandas.DataFrame
The dataframe whose rows represent particles that can be identified by the model. Supplied dataframes should have a “Hypothesis” column, which contains either a str or int, and a “Number of Hypotheses” column, which contains an int.
- target : str, default “Generated As”
The target of the model. The supplied dataframe must have a column with this label.
- match_hypothesis : bool, default False:
Require predictions to match the supplied hypothesis. If True, only considers predictions that match the hypothesis. Neutral particles, which have no hypothesis, are still considered in the typical sense. If False, the prediction of the model is the most frequent prediction among all hypotheses.
- n_repetitions : int, default 3
The number of times to permute each feature. The feature importance is the average accuracy over all of the repetitions.
- n_each : int, default 100
The number of events of each particle type to include in each permutation test.
- pidp_tools.analysis.get_charge(ptype)[source]¶
Returns the charge of the provided particle
Parameters¶
- ptype : str or int
The particle to find the charge of. If int, the particle is assumed to be the element of the following list with the corresponding index: [“Photon”, “KLong”, “Neutron”, “Proton”, “K+”, “Pi+”, “AntiMuon”, “Positron”, “AntiProton”, “K-”, “Pi-”, “Muon”, “Electron”, “No ID”].
Examples¶
>>> get_charge("AntiProton") -1 >>> get_charge(3) 1
- pidp_tools.analysis.grab_events(input_df, n_each=5000, reverse=False, return_strings=False, allow_less=False)[source]¶
Grabs the selected number of events for each particle type, preserving hypothesis groups.
Parameters¶
- input_df :
pandas.DataFrame
The dataframe to grab events from. The supplied dataframe should have a “Number of Hypotheses” column.
- n_each : int, default 5000
The number of events of each particle type to include in the resulting dataset. The number of events for each particle type may be smaller if “allow_less” is True.
- reverse : bool, default False
Grab events from the end of the dataframe first. If True, events are grabbed from the end of the file first.
- return_strings : bool, default False
Return a dataframe in which the “Hypothesis” and “Generated As” columns contain strings instead of integers. If True, the returned dataframe will have strings in the “Hypothesis” and “Generated As” columns.
- allow_less : bool, default False
Allow the final dataframe to have fewer than the requested number of events if not enough data is available. If True, the resulting dataframe may not have the requested number of events for each particle, and the number of events may be different for each particle type.
Returns¶
- smaller_dataset :
pandas.DataFrame
A dataframe containing the events grabbed from the input dataframe.
- input_df :
- pidp_tools.analysis.install_ROOT()[source]¶
Installs ROOT.
Examples¶
>>> from pidp_tools import \* >>> install_ROOT() >>> from ROOT import \*
- pidp_tools.analysis.load_model(path='my_model.joblib')[source]¶
Loads a model from a joblib dump at the specified path.
Parameters¶
- path : str, default “my_model.joblib”
The path to the model save location.
Returns¶
- model : Scikit-learn trained model.
The loaded scikit-learn model.
- pidp_tools.analysis.round_accuracies(num)[source]¶
Rounds a number to 2 decimal places. If the rounded number is 0.00 or 1.00, an int is returned.
Parameters¶
- num : float or int
The number to round.
Examples¶
>>> round_accuracies(0.3333333) 0.33 >>> round_accuracies(0.001) 0
- pidp_tools.analysis.save_model(model, path='my_model.joblib')[source]¶
Saves a model as a joblib dump at the specified path.
Parameters¶
- model : Any scikit-learn trained model.
The model to be saved.
- path : str, default “my_model.joblib”
The path to the model save location.
- pidp_tools.analysis.split_df(input_df, training_fraction=0.9)[source]¶
Splits the supplied dataframe into training data and test data, preserving hypothesis groups.
Parameters¶
- input_df :
pandas.DataFrame
The dataframe to split. The supplied dataframe should have a “Number of Hypotheses” column.
- training_fraction : float, default 0.9
The fraction of events to be included in the training dataset. All remaining events will be included in the test dataset.
Returns¶
- training :
pandas.DataFrame
A dataframe containing the requested fraction of the input data.
- test :
pandas.DataFrame
A dataframe containing the rows of the input data not included in the training dataset.
- input_df :