Summary

Project Architecture

MPGA Workflow Diagram

Project Overview

Our project follows a classic machine learning workflow. We start by collecting data from the Kepler, TESS, and K2 missions. Then, we take a close look at all the features, thinking about what they mean, what values they take, and how useful they might be for making predictions.

Feature Engineering & Data Preprocessing

Removed Features

Limit Features

We took out all features related to "limits" because they were mostly empty or filled with zeros, so they didn't add any value.

Irrelevant Features

We also removed features that didn't help with exoplanet prediction, like IDs or discovery dates, and anything that would make the task too easy or unrealistic.

Example: "sy_pnum" in K2

This feature tells you how many planets have already been found around a star. It doesn't make sense to use it for prediction, because for a new system, we wouldn't know this number in advance. That's exactly what we're trying to figure out!

Preprocessing Pipeline

We built a scikit-learn pipeline to process the data. Hereโ€™s what it does:

1

Columns Filtering

We remove columns that don't help or that would give away the answer.

2

Filling Missing Values

We fill in missing data using the best method for each column.

3

Encoding Categorical Variables

We turn text columns into numbers so the model can use them.

4

Standard Scaling Numerical Variables

We scale the numbers so they're all on a similar range, which helps the model learn.

Kepler Dataset Analysis

Dropped Columns

cols_to_drop = [
    "loc_rowid", "kepid", "kepoi_name", "kepler_name", 
    "koi_time0bk", "koi_time0bk_err1", "koi_time0bk_err2", 
    "koi_tce_plnt_num", "koi_tce_delivname", 
    "ra", "dec",
    "koi_pdisposition",  # Time-series-analysis cheating
    "koi_score"          # Score of cheat
]

All other features were retained for training.

Model Performance

We trained a Logistic Regression model using 5-fold cross-validation and tuned the parameters for best results. Hereโ€™s how it performed:

98.42%
Accuracy
97.30%
Precision
98.36%
Recall
97.83%
F1 Score
99.58%
ROC-AUC

Visualizations

ROC Curve & Confusion Matrix - Kepler

ROC Curve & Confusion Matrix

Feature Importance - Kepler

Feature Importance

Prediction Visualization - Kepler

Prediction Visualization on 2 Most Important Features

K2 Dataset Analysis

Used Features

Features: [
    'disposition', 'sy_snum', 'pl_controv_flag', 'pl_orbper',
    'pl_orbpererr1', 'pl_orbpererr2', 'pl_orbsmax', 'pl_orbsmaxerr1',
    'pl_orbsmaxerr2', 'pl_rade', 'pl_radeerr1', 'pl_radeerr2', 'pl_radj',
    'pl_radjerr1', 'pl_radjerr2', 'pl_bmasse', 'pl_bmasseerr1',
    'pl_bmasseerr2', 'pl_bmassj', 'pl_bmassjerr1', 'pl_bmassjerr2',
    'pl_bmassprov', 'pl_orbeccen', 'pl_orbeccenerr1', 'pl_orbeccenerr2',
    'pl_insol', 'pl_insolerr1', 'pl_insolerr2', 'pl_eqt', 'pl_eqterr1',
    'pl_eqterr2', 'ttv_flag', 'st_spectype', 'st_teff', 'st_tefferr1',
    'st_tefferr2', 'st_rad', 'st_raderr1', 'st_raderr2', 'st_mass',
    'st_masserr1', 'st_masserr2', 'st_met', 'st_meterr1', 'st_meterr2',
    'st_metratio', 'st_logg', 'st_loggerr1', 'st_loggerr2', 'sy_dist',
    'sy_disterr1', 'sy_disterr2', 'sy_vmag', 'sy_vmagerr1', 'sy_vmagerr2',
    'sy_kmag', 'sy_kmagerr1', 'sy_kmagerr2', 'sy_gaiamag', 'sy_gaiamagerr1',
    'sy_gaiamagerr2'
]

Target Variable

disposition is the target variable.

Data Filtering Strategy

We left out the candidate rows, since we can't be sure if they're real exoplanets or not. Our goal is to train a model that can tell the difference reliably.

Categorical Encoding

cat_cols_onehot = ["st_spectype", "st_metratio"]
cat_cols_label = ["pl_bmassprov"]

Model Performance

96.2%
Accuracy
99.13%
Precision
96.6%
Recall
97.8%
F1-Score

Visualizations

Confusion Matrix - K2

Confusion Matrix

Feature Importance - K2

Feature Importance

TESS Dataset Analysis

Most Challenging Dataset

The TESS dataset was the toughest to work with and needed more advanced modeling.

Model Evolution

Phase 1: Traditional ML Models

We started with some classic machine learning models:

  • Logistic Regression: 77.7% accuracy
  • Decision Tree: 81.16% accuracy

These results were okay, but we wanted to do better.

Phase 2: Autoencoder Architecture

Next, we tried an autoencoder with an extra output for the class prediction.

How it works

The model learns to compress the data into a smaller space, then reconstruct it and predict the class at the same time.

Result: 86.5% accuracy

This method gave us an embedding space where the two classes could be separated to some extent:

Embedding Space - Autoencoder

Embedding Space Visualization

Phase 3: 1D Convolutional Neural Network

๐Ÿ“š Inspiration

Inspired by the paper: "Identifying Exoplanets with Deep Learning: A CNN and RNN Classifier for Kepler DR25 and Candidate Vetting"

Bibin Thomas, Vittal Bhat M, Salman Arafath Mohammed, Abdul Wase Mohammed, Adis Abebaw Dessalegn, Mohit Mittal

๐Ÿ“„ Read Paper on arXiv
Model Architecture

1D CNN Model created with 460,929 parameters

Conv1DClassifier(
  (conv_layers): Sequential(
    (0): Conv1d(1, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.01)
    (3): Dropout(p=0.2, inplace=False)
    (4): Conv1d(64, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): LeakyReLU(negative_slope=0.01)
    (7): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (8): Dropout(p=0.2, inplace=False)
    (9): Conv1d(128, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (10): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): LeakyReLU(negative_slope=0.01)
    (12): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (13): Dropout(p=0.2, inplace=False)
  )
  (fc_layers): Sequential(
    (0): Linear(in_features=2560, out_features=128, bias=True)
    (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.01)
    (3): Dropout(p=0.2, inplace=False)
    (4): Linear(in_features=128, out_features=64, bias=True)
    (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): LeakyReLU(negative_slope=0.01)
    (7): Dropout(p=0.2, inplace=False)
    (8): Linear(in_features=64, out_features=1, bias=True)
    (9): Sigmoid()
  )
)
Best Result: 88.5% accuracy

๐Ÿ“ˆ Performance Visualizations

CNN Confusion Matrix - TESS

CNN Confusion Matrix

Model Comparison - TESS

Models Comparison

Models Performance Heatmap - TESS

Models Performance Heatmap

Conclusion

Our approach to exoplanet detection is based on careful feature engineering, solid data processing, and trying out different machine learning models until we found what worked best.

Each dataset had its own challenges, so we had to adapt our methods for each one:

  • Kepler: 98.42% accuracy with a tuned Logistic Regression model
  • K2: 99.13% precision by filtering out uncertain candidates
  • TESS: Improved from 77.7% to 88.5% by moving to more complex neural networks

These models are now running as APIs on Google Cloud Run, so anyone can use them for real-time exoplanet classification.

Explore the Code