Features

The ML Code Smell Detector checks for various code smells across different categories. Here’s a detailed breakdown of the smells it detects:

Framework-Specific Smells

General

  1. Import Checker: Ensures standard naming conventions for imported modules (e.g., import numpy as np).

Pandas

  1. Unnecessary Iteration: Detects use of .iterrows() which is often slower than vectorized operations.

  2. DataFrame Iteration Modification: Identifies modifications to DataFrames during iteration, which can lead to unexpected behavior.

  3. Chain Indexing: Detects chained indexing, which can lead to performance issues and unexpected behavior.

  4. Datatype Checker: Ensures explicit data type setting when importing data to prevent automatic type inference issues.

  5. Column Selection Checker: Encourages selecting necessary columns after importing DataFrames for clarity and performance.

  6. Merge Parameter Checker: Checks for proper use of parameters in merge operations to prevent data loss.

  7. InPlace Checker: Discourages use of inplace=True to prevent accidental data loss.

  8. DataFrame Conversion Checker: Encourages use of .to_numpy() instead of .values for future compatibility.

NumPy

  1. NaN Equality Checker: Detects improper NaN comparisons and suggests using np.isnan().

  2. Randomness Control Checker: Checks for proper random seed setting for reproducibility.

Scikit-learn

  1. Scaler Missing Checker: Ensures scaling is applied before scale-sensitive operations.

  2. Pipeline Checker: Encourages use of Pipelines to prevent data leakage.

  3. Cross Validation Checker: Checks for proper use of cross-validation techniques.

  4. Randomness Control Checker: Ensures consistent random state setting across estimators.

  5. Verbose Mode Checker: Encourages use of verbose mode for long-running processes.

  6. Dependent Threshold Checker: Suggests use of threshold-independent metrics alongside threshold-dependent ones.

  7. Unit Testing Checker: Checks for presence of unit tests.

  8. Data Leakage Checker: Ensures proper train-test splitting to prevent data leakage.

  9. Exception Handling Checker: Checks for proper exception handling in data processing steps.

TensorFlow

  1. Randomness Control Checker: Checks for proper random seed setting.

  2. Early Stopping Checker: Encourages use of early stopping to prevent overfitting.

  3. Checkpointing Checker: Ensures model checkpoints are saved during training.

  4. Memory Release Checker: Checks for proper memory clearing, especially in loops.

  5. Mask Missing Checker: Ensures proper masking in operations like tf.math.log.

  6. Tensor Array Checker: Encourages use of tf.TensorArray for dynamic tensor lists.

  7. Dependent Threshold Checker: Similar to Scikit-learn’s checker.

  8. Logging Checker: Encourages use of TensorBoard or other logging mechanisms.

  9. Batch Normalisation Checker: Checks for use of batch normalization layers.

  10. Dropout Usage Checker: Encourages use of dropout for regularization.

  11. Data Augmentation Checker: Checks for data augmentation techniques.

  12. Learning Rate Scheduler Checker: Encourages use of learning rate schedules.

  13. Model Evaluation Checker: Ensures proper model evaluation practices.

  14. Unit Testing Checker: Checks for TensorFlow-specific unit tests.

  15. Exception Handling Checker: Similar to Scikit-learn’s checker.

PyTorch

  1. Randomness Control Checker: Checks for proper random seed setting.

  2. Deterministic Algorithm Usage Checker: Encourages use of deterministic algorithms.

  3. Randomness Control Checker (PyTorch-Dataloader): Checks for proper random seed setting in DataLoader.

  4. Mask Missing Checker: Similar to TensorFlow’s checker.

  5. Net Forward Checker: Discourages direct calls to net.forward().

  6. Gradient Clear Checker: Ensures gradients are cleared before each backward pass.

  7. Batch Normalisation Checker: Similar to TensorFlow’s checker.

  8. Dropout Usage Checker: Similar to TensorFlow’s checker.

  9. Data Augmentation Checker: Checks for use of torchvision transforms.

  10. Learning Rate Scheduler Checker: Similar to TensorFlow’s checker.

  11. Logging Checker: Checks for use of tensorboardX or similar logging tools.

  12. Model Evaluation Checker: Ensures model is set to evaluation mode when appropriate.

  13. Unit Testing Checker: Similar to Scikit-learn’s checker.

  14. Exception Handling Checker: Similar to Scikit-learn’s checker.

General ML Smells

  1. Data Leakage: Checks for potential data leakage issues.

  2. Magic Numbers: Identifies hard-coded constants that should be named variables.

  3. Feature Scaling: Ensures consistent feature scaling across the dataset.

  4. Cross Validation: Checks for proper use of cross-validation techniques.

  5. Imbalanced Dataset Handling: Identifies if techniques for handling imbalanced datasets are used.

  6. Feature Selection: Checks if feature selection is applied with proper validation.

  7. Metric Selection: Ensures use of appropriate evaluation metrics.

  8. Model Persistence: Checks for proper model saving practices.

  9. Reproducibility: Ensures random seeds are set for reproducibility.

  10. Data Loading: Suggests efficient data loading practices for large datasets.

  11. Unused Features: Identifies potentially unused features.

  12. Overfit-Prone Practices: Checks for practices that might lead to overfitting.

  13. Error Handling: Ensures proper error handling in data processing.

  14. Hardcoded Filepaths: Identifies hardcoded file paths.

  15. Documentation: Checks for presence of docstrings and comments.

Hugging Face-Specific Smells

  1. Model Versioning: Ensures specific model versions are used for reproducibility.

  2. Tokenizer Caching: Checks if tokenizers are cached to avoid re-downloading.

  3. Model Caching: Checks if models are cached to avoid re-downloading.

  4. Deterministic Tokenization: Ensures consistent tokenization settings.

  5. Efficient Data Loading: Encourages use of efficient data loading techniques.

  6. Distributed Training: Checks for configuration of distributed training.

  7. Mixed Precision Training: Encourages use of mixed precision training for performance.

  8. Gradient Accumulation: Checks for gradient accumulation for large batch sizes.

  9. Learning Rate Scheduling: Ensures use of learning rate schedulers.

  10. Early Stopping: Checks for implementation of early stopping.