Features
The ML Code Smell Detector checks for various code smells across different categories. Here’s a detailed breakdown of the smells it detects:
Framework-Specific Smells
General
Import Checker: Ensures standard naming conventions for imported modules (e.g., import numpy as np).
Pandas
Unnecessary Iteration: Detects use of .iterrows() which is often slower than vectorized operations.
DataFrame Iteration Modification: Identifies modifications to DataFrames during iteration, which can lead to unexpected behavior.
Chain Indexing: Detects chained indexing, which can lead to performance issues and unexpected behavior.
Datatype Checker: Ensures explicit data type setting when importing data to prevent automatic type inference issues.
Column Selection Checker: Encourages selecting necessary columns after importing DataFrames for clarity and performance.
Merge Parameter Checker: Checks for proper use of parameters in merge operations to prevent data loss.
InPlace Checker: Discourages use of inplace=True to prevent accidental data loss.
DataFrame Conversion Checker: Encourages use of .to_numpy() instead of .values for future compatibility.
NumPy
NaN Equality Checker: Detects improper NaN comparisons and suggests using np.isnan().
Randomness Control Checker: Checks for proper random seed setting for reproducibility.
Scikit-learn
Scaler Missing Checker: Ensures scaling is applied before scale-sensitive operations.
Pipeline Checker: Encourages use of Pipelines to prevent data leakage.
Cross Validation Checker: Checks for proper use of cross-validation techniques.
Randomness Control Checker: Ensures consistent random state setting across estimators.
Verbose Mode Checker: Encourages use of verbose mode for long-running processes.
Dependent Threshold Checker: Suggests use of threshold-independent metrics alongside threshold-dependent ones.
Unit Testing Checker: Checks for presence of unit tests.
Data Leakage Checker: Ensures proper train-test splitting to prevent data leakage.
Exception Handling Checker: Checks for proper exception handling in data processing steps.
TensorFlow
Randomness Control Checker: Checks for proper random seed setting.
Early Stopping Checker: Encourages use of early stopping to prevent overfitting.
Checkpointing Checker: Ensures model checkpoints are saved during training.
Memory Release Checker: Checks for proper memory clearing, especially in loops.
Mask Missing Checker: Ensures proper masking in operations like tf.math.log.
Tensor Array Checker: Encourages use of tf.TensorArray for dynamic tensor lists.
Dependent Threshold Checker: Similar to Scikit-learn’s checker.
Logging Checker: Encourages use of TensorBoard or other logging mechanisms.
Batch Normalisation Checker: Checks for use of batch normalization layers.
Dropout Usage Checker: Encourages use of dropout for regularization.
Data Augmentation Checker: Checks for data augmentation techniques.
Learning Rate Scheduler Checker: Encourages use of learning rate schedules.
Model Evaluation Checker: Ensures proper model evaluation practices.
Unit Testing Checker: Checks for TensorFlow-specific unit tests.
Exception Handling Checker: Similar to Scikit-learn’s checker.
PyTorch
Randomness Control Checker: Checks for proper random seed setting.
Deterministic Algorithm Usage Checker: Encourages use of deterministic algorithms.
Randomness Control Checker (PyTorch-Dataloader): Checks for proper random seed setting in DataLoader.
Mask Missing Checker: Similar to TensorFlow’s checker.
Net Forward Checker: Discourages direct calls to net.forward().
Gradient Clear Checker: Ensures gradients are cleared before each backward pass.
Batch Normalisation Checker: Similar to TensorFlow’s checker.
Dropout Usage Checker: Similar to TensorFlow’s checker.
Data Augmentation Checker: Checks for use of torchvision transforms.
Learning Rate Scheduler Checker: Similar to TensorFlow’s checker.
Logging Checker: Checks for use of tensorboardX or similar logging tools.
Model Evaluation Checker: Ensures model is set to evaluation mode when appropriate.
Unit Testing Checker: Similar to Scikit-learn’s checker.
Exception Handling Checker: Similar to Scikit-learn’s checker.
General ML Smells
Data Leakage: Checks for potential data leakage issues.
Magic Numbers: Identifies hard-coded constants that should be named variables.
Feature Scaling: Ensures consistent feature scaling across the dataset.
Cross Validation: Checks for proper use of cross-validation techniques.
Imbalanced Dataset Handling: Identifies if techniques for handling imbalanced datasets are used.
Feature Selection: Checks if feature selection is applied with proper validation.
Metric Selection: Ensures use of appropriate evaluation metrics.
Model Persistence: Checks for proper model saving practices.
Reproducibility: Ensures random seeds are set for reproducibility.
Data Loading: Suggests efficient data loading practices for large datasets.
Unused Features: Identifies potentially unused features.
Overfit-Prone Practices: Checks for practices that might lead to overfitting.
Error Handling: Ensures proper error handling in data processing.
Hardcoded Filepaths: Identifies hardcoded file paths.
Documentation: Checks for presence of docstrings and comments.
Hugging Face-Specific Smells
Model Versioning: Ensures specific model versions are used for reproducibility.
Tokenizer Caching: Checks if tokenizers are cached to avoid re-downloading.
Model Caching: Checks if models are cached to avoid re-downloading.
Deterministic Tokenization: Ensures consistent tokenization settings.
Efficient Data Loading: Encourages use of efficient data loading techniques.
Distributed Training: Checks for configuration of distributed training.
Mixed Precision Training: Encourages use of mixed precision training for performance.
Gradient Accumulation: Checks for gradient accumulation for large batch sizes.
Learning Rate Scheduling: Ensures use of learning rate schedulers.
Early Stopping: Checks for implementation of early stopping.