2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 18:34:33 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 18:34:33 +02:00
2024-05-21 12:44:33 +02:00
2024-05-25 13:58:01 +02:00
2025-07-16 17:49:20 +02:00
2025-07-16 17:49:20 +02:00

logo ArffFiles

C++ License: MIT Gitea Release Gitea Last Commit

A modern C++17 header-only library to read ARFF (Attribute-Relation File Format) files and convert them into STL vectors for machine learning and data analysis applications.

Features

  • 🔧 Header-only: Simply include ArffFiles.hpp - no compilation required
  • 🚀 Modern C++17: Clean, efficient implementation using modern C++ standards
  • 🔄 Automatic Type Detection: Distinguishes between numeric and categorical attributes
  • 📊 Flexible Class Positioning: Support for class attributes at any position
  • 🎯 STL Integration: Returns standard std::vector containers for seamless integration
  • 🧹 Data Cleaning: Automatically handles missing values (lines with '?' are skipped)
  • 🏷️ Label Encoding: Automatic factorization of categorical features into numeric codes

Requirements

  • C++17 compatible compiler
  • Standard Library: Uses STL containers (no external dependencies)

Installation

Using Conan

# Add the package to your conanfile.txt
[requires]
arff-files/1.2.1

# Or install directly
conan install arff-files/1.2.1@

Manual Installation

Simply download ArffFiles.hpp and include it in your project:

#include "ArffFiles.hpp"

Quick Start

#include "ArffFiles.hpp"
#include <iostream>

int main() {
    ArffFiles arff;
    
    // Load ARFF file (class attribute at the end by default)
    arff.load("dataset.arff");
    
    // Get feature matrix and labels
    auto& X = arff.getX();  // std::vector<std::vector<float>>
    auto& y = arff.getY();  // std::vector<int>
    
    std::cout << "Dataset size: " << arff.getSize() << " samples" << std::endl;
    std::cout << "Features: " << X.size() << std::endl;
    std::cout << "Classes: " << arff.getLabels().size() << std::endl;
    
    return 0;
}

API Reference

Loading Data

// Load with class attribute at the end (default)
arff.load("dataset.arff");

// Load with class attribute at the beginning
arff.load("dataset.arff", false);

// Load with specific named class attribute
arff.load("dataset.arff", "class_name");

Accessing Data

// Get feature matrix (each inner vector is a feature, not a sample)
std::vector<std::vector<float>>& X = arff.getX();

// Get labels (encoded as integers)
std::vector<int>& y = arff.getY();

// Get dataset information
std::string className = arff.getClassName();
std::vector<std::string> labels = arff.getLabels();
unsigned long size = arff.getSize();

// Get attribute information
auto attributes = arff.getAttributes();  // std::vector<std::pair<std::string, std::string>>
auto numericFeatures = arff.getNumericAttributes();  // std::map<std::string, bool>

Utility Methods

// Get library version
std::string version = arff.version();

// Access raw lines (after preprocessing)
std::vector<std::string> lines = arff.getLines();

// Get label states mapping
auto states = arff.getStates();  // std::map<std::string, std::vector<std::string>>

Data Processing Pipeline

  1. File Parsing: Reads ARFF format, extracts @attribute declarations and data
  2. Missing Value Handling: Skips lines containing ? (missing values)
  3. Feature Type Detection: Automatically identifies REAL, INTEGER, NUMERIC vs categorical
  4. Label Positioning: Handles class attributes at any position in the data
  5. Factorization: Converts categorical features and labels to numeric codes
  6. Data Organization: Creates feature matrix X and label vector y

Example: Complete Workflow

#include "ArffFiles.hpp"
#include <iostream>

int main() {
    try {
        ArffFiles arff;
        arff.load("iris.arff");
        
        // Display dataset information
        std::cout << "Dataset: " << arff.getClassName() << std::endl;
        std::cout << "Samples: " << arff.getSize() << std::endl;
        std::cout << "Features: " << arff.getX().size() << std::endl;
        
        // Show class labels
        auto labels = arff.getLabels();
        std::cout << "Classes: ";
        for (const auto& label : labels) {
            std::cout << label << " ";
        }
        std::cout << std::endl;
        
        // Show which features are numeric
        auto numericFeatures = arff.getNumericAttributes();
        for (const auto& [feature, isNumeric] : numericFeatures) {
            std::cout << feature << ": " << (isNumeric ? "numeric" : "categorical") << std::endl;
        }
        
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    
    return 0;
}

Supported ARFF Features

  • Numeric attributes (@attribute feature REAL/INTEGER/NUMERIC)
  • Categorical attributes (@attribute feature {value1,value2,...})
  • Comments (lines starting with %)
  • Missing values (automatic skipping of lines with ?)
  • Flexible class attribute positioning
  • Case-insensitive attribute declarations

Error Handling

The library throws std::invalid_argument exceptions for:

  • Unable to open file
  • No attributes found in file
  • Specified class name not found

Development

Building and Testing

# Build and run tests
make build && make test

# Run tests with verbose output
make test opt="-s"

# Clean test artifacts
make clean

Using CMake Directly

mkdir build_debug
cmake -S . -B build_debug -D CMAKE_BUILD_TYPE=Debug -D ENABLE_TESTING=ON
cmake --build build_debug -t unit_tests_arffFiles
cd build_debug/tests && ./unit_tests_arffFiles

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Description
Library to read Arff Files and return STL vectors with the data read.
Readme MIT 2.1 MiB
2025-07-19 17:59:55 +00:00
Languages
C++ 53.7%
CMake 39.7%
Python 4.2%
Makefile 2.4%