rmontanana/BayesNet

Fork 0

Files

Ricardo Montañana Gómez 090172c6c5

Add Claude local discretization analysis

2025-07-04 12:19:58 +02:00

9.3 KiB

Raw Permalink Blame History

Local Discretization Analysis - BayesNet Library

Overview

This document analyzes the local discretization implementation in the BayesNet library, specifically focusing on the Proposal.cc implementation, and evaluates the feasibility of implementing an iterative discretization approach.

Current Local Discretization Implementation

Core Architecture

The local discretization functionality is implemented through a Proposal class (bayesnet/classifiers/Proposal.h) that serves as a mixin/base class for creating "Ld" (Local Discretization) variants of existing classifiers.

Key Components

1. The Proposal Class

Purpose: Handles continuous data by applying local discretization using discretization algorithms
Dependencies: Uses the fimdlp library for discretization algorithms
Supported Algorithms:
- MDLP (Minimum Description Length Principle) - Default
- BINQ - Quantile-based binning
- BINU - Uniform binning

2. Local Discretization Variants

The codebase implements Ld variants using multiple inheritance:

Individual Classifiers:

TANLd - Tree Augmented Naive Bayes with Local Discretization
KDBLd - K-Dependence Bayesian with Local Discretization
SPODELd - Super-Parent One-Dependence Estimator with Local Discretization

Ensemble Classifiers:

AODELd - Averaged One-Dependence Estimator with Local Discretization

Implementation Pattern

All Ld variants follow a consistent pattern using multiple inheritance:

class TANLd : public TAN, public Proposal {
    // Inherits from both the base classifier and Proposal
};

Two-Phase Discretization Process

Phase 1: Initial Discretization (`fit_local_discretization`)

Each continuous feature is discretized independently using the chosen algorithm
Creates initial discrete dataset
Uses only class labels for discretization decisions

Phase 2: Network-Aware Refinement (`localDiscretizationProposal`)

After building the initial Bayesian network structure
Features are re-discretized considering their parent nodes in the network
Uses topological ordering to ensure proper dependency handling
Creates more informed discretization boundaries based on network relationships

Hyperparameter Support

The Proposal class supports several configurable hyperparameters:

ld_algorithm: Choice of discretization algorithm (MDLP, BINQ, BINU)
ld_proposed_cuts: Number of proposed cuts for discretization
mdlp_min_length: Minimum interval length for MDLP
mdlp_max_depth: Maximum depth for MDLP tree

Current Implementation Strengths

Sophisticated Approach: Considers network structure in discretization decisions
Modular Design: Clean separation through Proposal class mixin
Multiple Algorithm Support: Flexible discretization strategies
Proper Dependency Handling: Topological ordering ensures correct processing
Well-Integrated: Seamless integration with existing classifier architecture

Areas for Improvement

Code Quality Issues

Dead Code: Line 161 in Proposal.cc contains unused variable allDigits
Performance Issues:
- String concatenation in tight loop (lines 82-84) using += operator
- Memory allocations could be optimized
- Tensor operations could be batched better
Error Handling: Could be more robust with better exception handling

Algorithm Clarity

Logic Clarity: The upgrade flag logic could be more descriptive
Variable Naming: Some variables need more descriptive names
Documentation: Better inline documentation of the two-phase process
Method Complexity: localDiscretizationProposal method is quite long and complex

Suggested Code Improvements

// Instead of string concatenation in loop:
for (auto idx : indices) {
    for (int i = 0; i < Xf.size(1); ++i) {
        yJoinParents[i] += to_string(pDataset.index({ idx, i }).item<int>());
    }
}

// Consider using stringstream or pre-allocation:
std::stringstream ss;
for (auto idx : indices) {
    for (int i = 0; i < Xf.size(1); ++i) {
        ss << pDataset.index({ idx, i }).item<int>();
        yJoinParents[i] = ss.str();
        ss.str("");
    }
}

Iterative Discretization Proposal

Concept

Implement an iterative process: discretize → build model → re-discretize → rebuild model → repeat until convergence.

Feasibility Assessment

Highly Feasible - The current implementation already provides a solid foundation with its two-phase approach, making extension straightforward.

Proposed Implementation Strategy

class IterativeProposal : public Proposal {
public:
    struct ConvergenceParams {
        int max_iterations = 10;
        double tolerance = 1e-6;
        bool check_network_structure = true;
        bool check_discretization_stability = true;
    };

private:
    map<string, vector<int>> iterativeLocalDiscretization(const torch::Tensor& y) {
        auto states = fit_local_discretization(y);  // Initial discretization
        Network previousModel, currentModel;
        int iteration = 0;
        
        do {
            previousModel = currentModel;
            
            // Build model with current discretization
            const torch::Tensor weights = torch::full({ pDataset.size(1) }, 1.0 / pDataset.size(1), torch::kDouble);
            currentModel.fit(pDataset, weights, pFeatures, pClassName, states, Smoothing_t::ORIGINAL);
            
            // Apply local discretization based on current model
            auto newStates = localDiscretizationProposal(states, currentModel);
            
            // Check for convergence
            if (hasConverged(previousModel, currentModel, states, newStates)) {
                break;
            }
            
            states = newStates;
            iteration++;
            
        } while (iteration < convergenceParams.max_iterations);
        
        return states;
    }
    
    bool hasConverged(const Network& prev, const Network& curr, 
                     const map<string, vector<int>>& oldStates,
                     const map<string, vector<int>>& newStates) {
        // Implementation of convergence criteria
        return checkNetworkStructureConvergence(prev, curr) && 
               checkDiscretizationStability(oldStates, newStates);
    }
};

Convergence Criteria Options

Network Structure Comparison: Compare edge sets between iterations

bool checkNetworkStructureConvergence(const Network& prev, const Network& curr) {
    // Compare adjacency matrices or edge lists
    return prev.getEdges() == curr.getEdges();
}

Discretization Stability: Check if cut points change significantly

bool checkDiscretizationStability(const map<string, vector<int>>& oldStates,
                                 const map<string, vector<int>>& newStates) {
    for (const auto& [feature, states] : oldStates) {
        if (states != newStates.at(feature)) {
            return false;
        }
    }
    return true;
}

Performance Metrics: Monitor accuracy/likelihood convergence
Maximum Iterations: Prevent infinite loops

Expected Benefits

Better Discretization Quality: Each iteration refines boundaries based on learned dependencies
Improved Model Accuracy: More informed discretization leads to better classification
Adaptive Process: Automatically finds optimal discretization-model combination
Principled Approach: Theoretically sound iterative refinement
Reduced Manual Tuning: Less need for hyperparameter optimization

Implementation Considerations

Convergence Detection: Need robust criteria to detect when to stop
Performance Impact: Multiple iterations increase computational cost
Overfitting Prevention: May need regularization to avoid over-discretization
Stability Guarantees: Ensure the process doesn't oscillate indefinitely
Memory Management: Handle multiple model instances efficiently

Integration Strategy

Backward Compatibility: Keep existing two-phase approach as default
Optional Feature: Add iterative mode as configurable option
Hyperparameter Extension: Add convergence-related parameters
Testing Framework: Comprehensive testing on standard datasets

Conclusion

The current local discretization implementation in BayesNet is well-designed and functional, providing a solid foundation for the proposed iterative enhancement. The iterative approach would significantly improve the quality of discretization by creating a feedback loop between model structure and discretization decisions.

The implementation is highly feasible given the existing architecture, and the expected benefits justify the additional computational complexity. The key to success will be implementing robust convergence criteria and maintaining the modularity of the current design.

Recommendations

Immediate Improvements: Fix code quality issues and optimize performance bottlenecks
Iterative Implementation: Develop the iterative approach as an optional enhancement
Comprehensive Testing: Validate improvements on standard benchmark datasets
Documentation: Enhance inline documentation and user guides
Performance Profiling: Monitor computational overhead and optimize where necessary

9.3 KiB Raw Permalink Blame History