#
ArffFiles

[]()


A modern C++17 header-only library to read **ARFF (Attribute-Relation File Format)** files and convert them into STL vectors for machine learning and data analysis applications.
## Features
- ๐ง **Header-only**: Simply include `ArffFiles.hpp` - no compilation required
- ๐ **Modern C++17**: Clean, efficient implementation using modern C++ standards
- ๐ **Automatic Type Detection**: Distinguishes between numeric and categorical attributes
- ๐ **Flexible Class Positioning**: Support for class attributes at any position
- ๐ฏ **STL Integration**: Returns standard `std::vector` containers for seamless integration
- ๐งน **Data Cleaning**: Automatically handles missing values (lines with '?' are skipped)
- ๐ท๏ธ **Label Encoding**: Automatic factorization of categorical features into numeric codes
## Requirements
- **C++17** compatible compiler
- **Standard Library**: Uses STL containers (no external dependencies)
## Installation
### Using Conan
```bash
# Add the package to your conanfile.txt
[requires]
arff-files/1.0.1
# Or install directly
conan install arff-files/1.0.1@
```
### Manual Installation
Simply download `ArffFiles.hpp` and include it in your project:
```cpp
#include "ArffFiles.hpp"
```
## Quick Start
```cpp
#include "ArffFiles.hpp"
#include
int main() {
ArffFiles arff;
// Load ARFF file (class attribute at the end by default)
arff.load("dataset.arff");
// Get feature matrix and labels
auto& X = arff.getX(); // std::vector>
auto& y = arff.getY(); // std::vector
std::cout << "Dataset size: " << arff.getSize() << " samples" << std::endl;
std::cout << "Features: " << X.size() << std::endl;
std::cout << "Classes: " << arff.getLabels().size() << std::endl;
return 0;
}
```
## API Reference
### Loading Data
```cpp
// Load with class attribute at the end (default)
arff.load("dataset.arff");
// Load with class attribute at the beginning
arff.load("dataset.arff", false);
// Load with specific named class attribute
arff.load("dataset.arff", "class_name");
```
### Accessing Data
```cpp
// Get feature matrix (each inner vector is a feature, not a sample)
std::vector>& X = arff.getX();
// Get labels (encoded as integers)
std::vector& y = arff.getY();
// Get dataset information
std::string className = arff.getClassName();
std::vector labels = arff.getLabels();
unsigned long size = arff.getSize();
// Get attribute information
auto attributes = arff.getAttributes(); // std::vector>
auto numericFeatures = arff.getNumericAttributes(); // std::map
```
### Utility Methods
```cpp
// Get library version
std::string version = arff.version();
// Access raw lines (after preprocessing)
std::vector lines = arff.getLines();
// Get label states mapping
auto states = arff.getStates(); // std::map>
```
## Data Processing Pipeline
1. **File Parsing**: Reads ARFF format, extracts `@attribute` declarations and data
2. **Missing Value Handling**: Skips lines containing `?` (missing values)
3. **Feature Type Detection**: Automatically identifies `REAL`, `INTEGER`, `NUMERIC` vs categorical
4. **Label Positioning**: Handles class attributes at any position in the data
5. **Factorization**: Converts categorical features and labels to numeric codes
6. **Data Organization**: Creates feature matrix `X` and label vector `y`
## Example: Complete Workflow
```cpp
#include "ArffFiles.hpp"
#include
int main() {
try {
ArffFiles arff;
arff.load("iris.arff");
// Display dataset information
std::cout << "Dataset: " << arff.getClassName() << std::endl;
std::cout << "Samples: " << arff.getSize() << std::endl;
std::cout << "Features: " << arff.getX().size() << std::endl;
// Show class labels
auto labels = arff.getLabels();
std::cout << "Classes: ";
for (const auto& label : labels) {
std::cout << label << " ";
}
std::cout << std::endl;
// Show which features are numeric
auto numericFeatures = arff.getNumericAttributes();
for (const auto& [feature, isNumeric] : numericFeatures) {
std::cout << feature << ": " << (isNumeric ? "numeric" : "categorical") << std::endl;
}
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
return 1;
}
return 0;
}
```
## Supported ARFF Features
- โ
Numeric attributes (`@attribute feature REAL/INTEGER/NUMERIC`)
- โ
Categorical attributes (`@attribute feature {value1,value2,...}`)
- โ
Comments (lines starting with `%`)
- โ
Missing values (automatic skipping of lines with `?`)
- โ
Flexible class attribute positioning
- โ
Case-insensitive attribute declarations
## Error Handling
The library throws `std::invalid_argument` exceptions for:
- Unable to open file
- No attributes found in file
- Specified class name not found
## Development
### Building and Testing
```bash
# Build and run tests
make build && make test
# Run tests with verbose output
make test opt="-s"
# Clean test artifacts
make clean
```
### Using CMake Directly
```bash
mkdir build_debug
cmake -S . -B build_debug -D CMAKE_BUILD_TYPE=Debug -D ENABLE_TESTING=ON
cmake --build build_debug -t unit_tests_arffFiles
cd build_debug/tests && ./unit_tests_arffFiles
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.