diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..25b6761 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,83 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +ArffFiles is a header-only C++ library for reading ARFF (Attribute-Relation File Format) files and converting them into STL vectors. The library handles both numeric and categorical features, automatically factorizing categorical attributes. + +## Build System + +This project uses CMake with Conan for package management: +- **CMake**: Primary build system (requires CMake 3.20+) +- **Conan**: Package management for dependencies +- **Makefile**: Convenience wrapper for common tasks + +## Common Development Commands + +### Building and Testing +```bash +# Build and run tests (recommended) +make build && make test + +# Alternative manual build process +mkdir build_debug +cmake -S . -B build_debug -D CMAKE_BUILD_TYPE=Debug -D ENABLE_TESTING=ON -D CODE_COVERAGE=ON +cmake --build build_debug -t unit_tests_arffFiles -j 16 +cd build_debug/tests && ./unit_tests_arffFiles +``` + +### Testing Options +```bash +# Run tests with verbose output +make test opt="-s" + +# Clean test artifacts +make clean +``` + +### Code Coverage +Code coverage is enabled when building with `-D CODE_COVERAGE=ON` and `-D ENABLE_TESTING=ON`. Coverage reports are generated during test runs. + +## Architecture + +### Core Components + +**Single Header Library**: `ArffFiles.hpp` contains the complete implementation. + +**Main Class**: `ArffFiles` +- Header-only design for easy integration +- Handles ARFF file parsing and data conversion +- Automatically determines numeric vs categorical features +- Supports flexible class attribute positioning + +### Key Methods +- `load(fileName, classLast=true)`: Load with class attribute at end/beginning +- `load(fileName, className)`: Load with specific named class attribute +- `getX()`: Returns feature vectors as `std::vector>` +- `getY()`: Returns labels as `std::vector` +- `getNumericAttributes()`: Returns feature type mapping + +### Data Processing Pipeline +1. **File Parsing**: Reads ARFF format, extracts attributes and data +2. **Feature Detection**: Automatically identifies numeric vs categorical attributes +3. **Preprocessing**: Handles missing values (lines with '?' are skipped) +4. **Factorization**: Converts categorical features to numeric codes +5. **Dataset Generation**: Creates final X (features) and y (labels) vectors + +### Dependencies +- **Catch2**: Testing framework (fetched via CMake FetchContent) +- **Standard Library**: Uses STL containers (vector, map, string) +- **C++17**: Minimum required standard + +### Test Structure +- Tests located in `tests/` directory +- Sample ARFF files in `tests/data/` +- Single test executable: `unit_tests_arffFiles` +- Uses Catch2 v3.3.2 for test framework + +### Conan Integration +The project includes a `conanfile.py` that: +- Automatically extracts version from CMakeLists.txt +- Packages as a header-only library +- Exports only the main header file \ No newline at end of file diff --git a/README.md b/README.md index 79d70f9..9275e1c 100644 --- a/README.md +++ b/README.md @@ -5,10 +5,207 @@ ![Gitea Release](https://img.shields.io/gitea/v/release/rmontanana/arfffiles?gitea_url=https://gitea.rmontanana.es:3000) ![Gitea Last Commit](https://img.shields.io/gitea/last-commit/rmontanana/arfffiles?gitea_url=https://gitea.rmontanana.es:3000&logo=gitea) -Header-only library to read Arff Files and return STL vectors with the data read. +A modern C++17 header-only library to read **ARFF (Attribute-Relation File Format)** files and convert them into STL vectors for machine learning and data analysis applications. -### Tests +## Features + +- ๐Ÿ”ง **Header-only**: Simply include `ArffFiles.hpp` - no compilation required +- ๐Ÿš€ **Modern C++17**: Clean, efficient implementation using modern C++ standards +- ๐Ÿ”„ **Automatic Type Detection**: Distinguishes between numeric and categorical attributes +- ๐Ÿ“Š **Flexible Class Positioning**: Support for class attributes at any position +- ๐ŸŽฏ **STL Integration**: Returns standard `std::vector` containers for seamless integration +- ๐Ÿงน **Data Cleaning**: Automatically handles missing values (lines with '?' are skipped) +- ๐Ÿท๏ธ **Label Encoding**: Automatic factorization of categorical features into numeric codes + +## Requirements + +- **C++17** compatible compiler +- **Standard Library**: Uses STL containers (no external dependencies) + +## Installation + +### Using Conan ```bash -make build && make test +# Add the package to your conanfile.txt +[requires] +arff-files/1.0.1 + +# Or install directly +conan install arff-files/1.0.1@ ``` + +### Manual Installation + +Simply download `ArffFiles.hpp` and include it in your project: + +```cpp +#include "ArffFiles.hpp" +``` + +## Quick Start + +```cpp +#include "ArffFiles.hpp" +#include + +int main() { + ArffFiles arff; + + // Load ARFF file (class attribute at the end by default) + arff.load("dataset.arff"); + + // Get feature matrix and labels + auto& X = arff.getX(); // std::vector> + auto& y = arff.getY(); // std::vector + + std::cout << "Dataset size: " << arff.getSize() << " samples" << std::endl; + std::cout << "Features: " << X.size() << std::endl; + std::cout << "Classes: " << arff.getLabels().size() << std::endl; + + return 0; +} +``` + +## API Reference + +### Loading Data + +```cpp +// Load with class attribute at the end (default) +arff.load("dataset.arff"); + +// Load with class attribute at the beginning +arff.load("dataset.arff", false); + +// Load with specific named class attribute +arff.load("dataset.arff", "class_name"); +``` + +### Accessing Data + +```cpp +// Get feature matrix (each inner vector is a feature, not a sample) +std::vector>& X = arff.getX(); + +// Get labels (encoded as integers) +std::vector& y = arff.getY(); + +// Get dataset information +std::string className = arff.getClassName(); +std::vector labels = arff.getLabels(); +unsigned long size = arff.getSize(); + +// Get attribute information +auto attributes = arff.getAttributes(); // std::vector> +auto numericFeatures = arff.getNumericAttributes(); // std::map +``` + +### Utility Methods + +```cpp +// Get library version +std::string version = arff.version(); + +// Access raw lines (after preprocessing) +std::vector lines = arff.getLines(); + +// Get label states mapping +auto states = arff.getStates(); // std::map> +``` + +## Data Processing Pipeline + +1. **File Parsing**: Reads ARFF format, extracts `@attribute` declarations and data +2. **Missing Value Handling**: Skips lines containing `?` (missing values) +3. **Feature Type Detection**: Automatically identifies `REAL`, `INTEGER`, `NUMERIC` vs categorical +4. **Label Positioning**: Handles class attributes at any position in the data +5. **Factorization**: Converts categorical features and labels to numeric codes +6. **Data Organization**: Creates feature matrix `X` and label vector `y` + +## Example: Complete Workflow + +```cpp +#include "ArffFiles.hpp" +#include + +int main() { + try { + ArffFiles arff; + arff.load("iris.arff"); + + // Display dataset information + std::cout << "Dataset: " << arff.getClassName() << std::endl; + std::cout << "Samples: " << arff.getSize() << std::endl; + std::cout << "Features: " << arff.getX().size() << std::endl; + + // Show class labels + auto labels = arff.getLabels(); + std::cout << "Classes: "; + for (const auto& label : labels) { + std::cout << label << " "; + } + std::cout << std::endl; + + // Show which features are numeric + auto numericFeatures = arff.getNumericAttributes(); + for (const auto& [feature, isNumeric] : numericFeatures) { + std::cout << feature << ": " << (isNumeric ? "numeric" : "categorical") << std::endl; + } + + } catch (const std::exception& e) { + std::cerr << "Error: " << e.what() << std::endl; + return 1; + } + + return 0; +} +``` + +## Supported ARFF Features + +- โœ… Numeric attributes (`@attribute feature REAL/INTEGER/NUMERIC`) +- โœ… Categorical attributes (`@attribute feature {value1,value2,...}`) +- โœ… Comments (lines starting with `%`) +- โœ… Missing values (automatic skipping of lines with `?`) +- โœ… Flexible class attribute positioning +- โœ… Case-insensitive attribute declarations + +## Error Handling + +The library throws `std::invalid_argument` exceptions for: +- Unable to open file +- No attributes found in file +- Specified class name not found + +## Development + +### Building and Testing + +```bash +# Build and run tests +make build && make test + +# Run tests with verbose output +make test opt="-s" + +# Clean test artifacts +make clean +``` + +### Using CMake Directly + +```bash +mkdir build_debug +cmake -S . -B build_debug -D CMAKE_BUILD_TYPE=Debug -D ENABLE_TESTING=ON +cmake --build build_debug -t unit_tests_arffFiles +cd build_debug/tests && ./unit_tests_arffFiles +``` + +## License + +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. + +## Contributing + +Contributions are welcome! Please feel free to submit a Pull Request. diff --git a/TECHNICAL_REPORT.md b/TECHNICAL_REPORT.md new file mode 100644 index 0000000..94b2dba --- /dev/null +++ b/TECHNICAL_REPORT.md @@ -0,0 +1,242 @@ +# ArffFiles Library - Technical Analysis Report + +**Generated**: 2025-06-27 +**Version Analyzed**: 1.1.0 +**Library Type**: Header-only C++17 ARFF File Parser + +## Executive Summary + +The ArffFiles library is a functional header-only C++17 implementation for parsing ARFF (Attribute-Relation File Format) files. While it successfully accomplishes its core purpose, several significant weaknesses in design, performance, and robustness have been identified that could impact production use. + +**Overall Assessment**: โš ๏ธ **MODERATE RISK** - Functional but requires improvements for production use. + +--- + +## ๐ŸŸข Strengths + +### 1. **Architectural Design** +- โœ… **Header-only**: Easy integration, no compilation dependencies +- โœ… **Modern C++17**: Uses appropriate standard library features +- โœ… **Clear separation**: Public/protected/private access levels well-defined +- โœ… **STL Integration**: Returns standard containers for seamless integration + +### 2. **Functionality** +- โœ… **Flexible class positioning**: Supports class attributes at any position +- โœ… **Automatic type detection**: Distinguishes numeric vs categorical attributes +- โœ… **Missing value handling**: Skips lines with '?' characters +- โœ… **Label encoding**: Automatic factorization of categorical features +- โœ… **Case-insensitive parsing**: Handles @ATTRIBUTE/@attribute variations + +### 3. **API Usability** +- โœ… **Multiple load methods**: Three different loading strategies +- โœ… **Comprehensive getters**: Good access to internal data structures +- โœ… **Utility functions**: Includes trim() and split() helpers + +### 4. **Testing Coverage** +- โœ… **Real datasets**: Tests with iris, glass, adult, and Japanese vowels datasets +- โœ… **Edge cases**: Tests different class positioning scenarios +- โœ… **Data validation**: Verifies parsing accuracy with expected values + +--- + +## ๐Ÿ”ด Critical Weaknesses + +### 1. **Memory Management & Performance Issues** + +#### **Inefficient Data Layout** (HIGH SEVERITY) +```cpp +// Line 131: Inefficient memory allocation +X = std::vector>(attributes.size(), std::vector(lines.size())); +``` +- **Problem**: Feature-major layout instead of sample-major +- **Impact**: Poor cache locality, inefficient for ML algorithms +- **Memory overhead**: Double allocation for `X` and `Xs` vectors +- **Performance**: Suboptimal for large datasets + +#### **Redundant Memory Usage** (MEDIUM SEVERITY) +```cpp +std::vector> X; // Line 89 +std::vector> Xs; // Line 90 +``` +- **Problem**: Maintains both numeric and string representations +- **Impact**: 2x memory usage for categorical features +- **Memory waste**: `Xs` could be deallocated after factorization + +#### **No Memory Pre-allocation** (MEDIUM SEVERITY) +- **Problem**: Multiple vector resizing during parsing +- **Impact**: Memory fragmentation and performance degradation + +### 2. **Error Handling & Robustness** + +#### **Unsafe Type Conversions** (HIGH SEVERITY) +```cpp +// Line 145: No exception handling +X[xIndex][i] = stof(token); +``` +- **Problem**: `stof()` can throw `std::invalid_argument` or `std::out_of_range` +- **Impact**: Program termination on malformed numeric data +- **Missing validation**: No checks for valid numeric format + +#### **Insufficient Input Validation** (HIGH SEVERITY) +```cpp +// Line 39: Unsafe comparison without bounds checking +for (int i = 0; i < attributes.size(); ++i) +``` +- **Problem**: No validation of file structure integrity +- **Missing checks**: + - Empty attribute names + - Duplicate attribute names + - Malformed attribute declarations + - Inconsistent number of tokens per line + +#### **Resource Management** (MEDIUM SEVERITY) +```cpp +// Line 163-194: No RAII for file handling +std::ifstream file(fileName); +// ... processing ... +file.close(); // Manual close +``` +- **Problem**: Manual file closing (though acceptable here) +- **Potential issue**: No exception safety guarantee + +### 3. **Algorithm & Design Issues** + +#### **Inefficient String Processing** (MEDIUM SEVERITY) +```cpp +// Line 176-182: Inefficient attribute parsing +std::stringstream ss(line); +ss >> keyword >> attribute; +type = ""; +while (ss >> type_w) + type += type_w + " "; // String concatenation in loop +``` +- **Problem**: Repeated string concatenation is O(nยฒ) +- **Impact**: Performance degradation on large files +- **Solution needed**: Use string reserve or stringstream + +#### **Suboptimal Lookup Performance** (LOW SEVERITY) +```cpp +// Line 144: Map lookup in hot path +if (numeric_features[attributes[xIndex].first]) +``` +- **Problem**: Hash map lookup for every data point +- **Impact**: Unnecessary overhead during dataset generation + +### 4. **API Design Limitations** + +#### **Return by Value Issues** (MEDIUM SEVERITY) +```cpp +// Line 55-60: Expensive copies +std::vector getLines() const { return lines; } +std::map> getStates() const { return states; } +``` +- **Problem**: Large object copies instead of const references +- **Impact**: Unnecessary memory allocation and copying +- **Performance**: O(n) copy cost for large datasets + +#### **Non-const Correctness** (MEDIUM SEVERITY) +```cpp +// Line 68-69: Mutable references without const alternatives +std::vector>& getX() { return X; } +std::vector& getY() { return y; } +``` +- **Problem**: No const versions for read-only access +- **Impact**: API design inconsistency, potential accidental modification + +#### **Type Inconsistency** (LOW SEVERITY) +```cpp +// Line 56: Mixed return types +unsigned long int getSize() const { return lines.size(); } +``` +- **Problem**: Should use `size_t` or `std::size_t` +- **Impact**: Type conversion warnings on some platforms + +### 5. **Thread Safety** + +#### **Not Thread-Safe** (MEDIUM SEVERITY) +- **Problem**: No synchronization mechanisms +- **Impact**: Unsafe for concurrent access +- **Missing**: Thread-safe accessors or documentation warning + +### 6. **Security Considerations** + +#### **Path Traversal Vulnerability** (LOW SEVERITY) +```cpp +// Line 161: No path validation +void loadCommon(std::string fileName) +``` +- **Problem**: No validation of file path +- **Impact**: Potential directory traversal if user input not sanitized +- **Mitigation**: Application-level validation needed + +#### **Resource Exhaustion** (MEDIUM SEVERITY) +- **Problem**: No limits on file size or memory usage +- **Impact**: Potential DoS with extremely large files +- **Missing**: File size validation and memory limits + +### 7. **ARFF Format Compliance** + +#### **Limited Format Support** (MEDIUM SEVERITY) +- **Missing features**: + - Date attributes (`@attribute date "yyyy-MM-dd HH:mm:ss"`) + - String attributes (`@attribute text string`) + - Relational attributes (nested ARFF) + - Sparse data format (`{0 X, 3 Y, 5 Z}`) + +#### **Parsing Edge Cases** (LOW SEVERITY) +```cpp +// Line 188: Simplistic missing value detection +if (line.find("?", 0) != std::string::npos) +``` +- **Problem**: Doesn't handle quoted '?' characters +- **Impact**: May incorrectly skip valid data containing '?' in strings + +--- + +## ๐Ÿ”ง Recommended Improvements + +### High Priority +1. **Add exception handling** around `stof()` calls +2. **Implement proper input validation** for malformed data +3. **Fix memory layout** to sample-major organization +4. **Add const-correct API methods** +5. **Optimize string concatenation** in parsing + +### Medium Priority +1. **Implement RAII** patterns consistently +2. **Add memory usage limits** and validation +3. **Provide const reference getters** for large objects +4. **Document thread safety** requirements +5. **Add comprehensive error reporting** + +### Low Priority +1. **Extend ARFF format support** (dates, strings, sparse) +2. **Optimize lookup performance** with cached indices +3. **Add file path validation** +4. **Implement move semantics** for performance + +--- + +## ๐Ÿ“Š Performance Metrics (Estimated) + +| Dataset Size | Memory Overhead | Performance Impact | +|--------------|-----------------|-------------------| +| Small (< 1MB) | ~200% | Negligible | +| Medium (10MB) | ~300% | Moderate | +| Large (100MB+) | ~400% | Significant | + +**Note**: Overhead includes duplicate storage and inefficient layout. + +--- + +## ๐ŸŽฏ Conclusion + +The ArffFiles library successfully implements core ARFF parsing functionality but suffers from several design and implementation issues that limit its suitability for production environments. The most critical concerns are: + +1. **Lack of robust error handling** leading to potential crashes +2. **Inefficient memory usage** limiting scalability +3. **Performance issues** with large datasets + +While functional for small to medium datasets in controlled environments, significant refactoring would be required for production use with large datasets or untrusted input. + +**Recommendation**: Consider this library suitable for prototyping and small-scale applications, but plan for refactoring before production deployment. \ No newline at end of file diff --git a/conanfile.py b/conanfile.py new file mode 100644 index 0000000..b527122 --- /dev/null +++ b/conanfile.py @@ -0,0 +1,43 @@ +import re +from conan import ConanFile +from conan.tools.files import copy + + +class ArffFilesConan(ConanFile): + name = "arff-files" + version = "X.X.X" + description = ( + "Header-only library to read ARFF (Attribute-Relation File Format) files and return STL vectors with the data read." + ) + url = "https://github.com/rmontanana/ArffFiles" + license = "MIT" + homepage = "https://github.com/rmontanana/ArffFiles" + topics = ("arff", "data-processing", "file-parsing", "header-only", "cpp17") + no_copy_source = True + exports_sources = "ArffFiles.hpp", "LICENSE", "README.md" + package_type = "header-library" + + def init(self): + # Read the CMakeLists.txt file to get the version + with open("CMakeLists.txt", "r") as f: + lines = f.readlines() + for line in lines: + if "VERSION" in line: + # Extract the version number using regex + match = re.search(r"VERSION\s+(\d+\.\d+\.\d+)", line) + if match: + self.version = match.group(1) + + def package(self): + # Copy header file to include directory + copy(self, "*.hpp", src=self.source_folder, dst=self.package_folder, keep_path=False) + # Copy license and readme for package documentation + copy(self, "LICENSE", src=self.source_folder, dst=self.package_folder, keep_path=False) + copy(self, "README.md", src=self.source_folder, dst=self.package_folder, keep_path=False) + + def package_info(self): + # Header-only library configuration + self.cpp_info.bindirs = [] + self.cpp_info.libdirs = [] + # Set include directory (header will be in package root) + self.cpp_info.includedirs = ["."]