Machine Learning for Eigenanalysis

Let's start with the problem. Eigenanalysis for large systems requires high computational resources. Reducing the computational time of the solution to a large eigenproblem in many areas of science and engineering has been the subject of numerous studies. Here, I am employing different machine learning models to try to predict the eigenvalues of a given matrix.

The matrix that I start with is a large 502x502 sparse system of linear equations resulting from the Jacobian of the solution to the Burgers problem. In this project, gathering data is one of the most difficult tasks (if not the most difficult) due to the variety in problems. I have gathered the information for as many matrices as possible to try to emulate a true machine learning pipeline.

To generate many matrices, the numerical mesh used to set up the Burgers equations on is changed in a certain. For this purpose, vertex[15] is moved around and the resulting Jacobian matrix is used for the eigenanalysis. The following figure shows the location of vertex[15] in relation to the adjacent cells. This figure also depicts the coordinates of the neighboring vertices.

The numerical mesh

The next issue is that for each entry in the data there will be at least 502x502 = 252004 vector entries which correspond to about 252k features. Given the number of examples in the data I have gathered, training an ML model would be impossible. As a result, I have used a clever way to extract the most important features to the problem at hand. This selection will differ from problem to problem. For now, I won't go into detail of this method.

Data Analysis

In this section, I will explain the data and relevant features and the target.

The data includes 38 features and a single target value. The features and target are all numeric values. Here is a description of these values;

There are no missing values.

Some features are always zero or close to zero. We are safe to remove these features as they won't affect our eventual ML model.

Here is the list of the dropped features.

As the first experiment, I will only focus on finding out whether the eigenvalue target is positive or negative. As a result, I am going to change the problem into a classification problem by changing the target value to binary. I am removing eigenvalue column and adding target column.

Let's focus on the data and do some experiments with it to familiarize ourselves.

As seen, there is good separation between target values for most features. As a result, tree-based models might work well for our application.

As seen, there is good linear correlation between target and many features.

Problem Definition

Let's first have a look at a contour plot that explains the relation between two features vertex[15] x and vertex[15] y and the original target value (the real part of the least stable eigenvalue). As seen here, there is a clear boundary between the eigenvalues with positive and negative real parts. It seems we are able to use tree-based models for the classification problem at hand. We can also use KNNs and SVM RBF.

Contour plot of eigenvalue vs mesh coordinates

As a result, let's only use the first two features vertex[15] x and vertex[15] y to predict the binary target.

Data Splitting and Target Extraction

Next, I will split the data into train and test splits. I use 20% of the examples for the test portion which is an acceptable size as we have many examples in our data.

Define a preprocessor to apply scalar transformation to the numeric_features.

Define a pipeline for the application of the preprocessor along with a DummyClassifier.

I am defining the following function for cross-validation purposes.

I am using scoring method of roc_auc which is a good choice for classification problems with class imbalance.

The Area Under Curve (auc) score is equal to 0.5 which is the result of random guessing. This is the expected behavior from a DummyClassifier.

Next, let's try a LogisticRegression model. Here, I am carrying out hyperparameter optimization using RandomizedSearchCV. The optimization is performed for the hyperparameter C in LogisticRegression.

I am making a new pipeline with the optimized hyperparameter value for C in LogisticRegression.

We are getting roc_auc of 1.0 which is the highest possible score. We are predicting all the targets correctly in cross-validation.

I'll try out three more models including KNeighborsClassifier, SVC, and XGBClassifier. To deal with class imbalance, we choose the ratio of negative values to positive values for scale_pos_weight in XGBoost which is suggested in the documentations. Further, we choose the class_weight of SVC to be balanced.

All of the new models also perform very well with the default hyperparameter values.

I am choosing the LogisticRegression model with the optimized hyperparameter C for final evaluation.

The scores look promising.

As seen in the confusion matrix for the trained pipeline (on the training set of course), the number of false positives and false negatives are small compared to the true positives and negatives.

Defining a New Problem

In this section, I will change the problem to use the Jacobian matrix entries as features and will drop the vertex[15] location features.

Define a new column transformer to apply scaling to the new features.

For this part, I will use LGBMClassifier classifier. Let's build a pipeline and carry out the required computations.

Same as before, we are getting astonishing results.

Study Feature Importance

An important task that is left is to study feature importance and figure out which features are contributing most to our predictions. For LogisticRegression we can simply look into the feature coefficients. However, in the case of non scikit-learn models such as LGBMClassifier we need to use more sophisticated methods. Let's start with eli5. We can use eli5 to explain the weights with features in our LGBMClassifier model.

Here, eli5 tells us that the feature with the highest impact on prediction is jac[6][13] which is interesting.

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray.

SHAP is also telling us that the most important feature to predict the positive class on average is jac[6][13]. Here, jac[15][15] also has a large importance. We can think of these values as global feature importance.

This dependence plot tells us that for high values of jac[6][13] we are likely to get high SHAP values which correspond to predicting the positive class. On the other hand, lower values of jac[6][13] gives us lower SHAP values to predict the negative class.

The summary plot also tells us that higher values of jac[6][13] are likely to give the positive class which is the same for jac[15][15].

Now, let's try to explain the prediction for some of the examples. Here is the prediction probability for one example.

And here is the raw score of the model for this prediction.

As seen in this force plot, the raw score is less than the base value which means we are predicting the negative class. This prediction is correct as seen in the following.

Next example,

Here, the raw score is higher than the base value which means we are predicting the positive class. This is a correct prediction as presented in the following.