mlpack::data Namespace Reference

Functions to load and save matrices and models. More...

Classes

class  BagOfWordsEncodingPolicy
 Definition of the BagOfWordsEncodingPolicy class. More...

 
class  CharExtract
 The class is used to split a string into characters. More...

 
class  CustomImputation
 A simple custom imputation class. More...

 
class  DatasetMapper
 Auxiliary information for a dataset, including mappings to/from strings (or other types) and the datatype of each dimension. More...

 
class  DictionaryEncodingPolicy
 DicitonaryEnocdingPolicy is used as a helper class for StringEncoding. More...

 
struct  HasSerialize
 
struct  HasSerializeFunction
 
class  ImageInfo
 Implements meta-data of images required by data::Load and data::Save for loading and saving images into arma::Mat. More...

 
class  Imputer
 Given a dataset of a particular datatype, replace user-specified missing value with a variable dependent on the StrategyType and MapperType. More...

 
class  IncrementPolicy
 IncrementPolicy is used as a helper class for DatasetMapper. More...

 
class  ListwiseDeletion
 A complete-case analysis to remove the values containing mappedValue. More...

 
class  LoadCSV
 Load the csv file.This class use boost::spirit to implement the parser, please refer to following link http://theboostcpplibraries.com/boost.spirit for quick review. More...

 
class  MaxAbsScaler
 A simple MaxAbs Scaler class. More...

 
class  MeanImputation
 A simple mean imputation class. More...

 
class  MeanNormalization
 A simple Mean Normalization class. More...

 
class  MedianImputation
 This is a class implementation of simple median imputation. More...

 
class  MinMaxScaler
 A simple MinMax Scaler class. More...

 
class  MissingPolicy
 MissingPolicy is used as a helper class for DatasetMapper. More...

 
class  PCAWhitening
 A simple PCAWhitening class. More...

 
class  ScalingModel
 The model to save to disk. More...

 
class  SplitByAnyOf
 The SplitByAnyOf class tokenizes a string using a set of delimiters. More...

 
class  StandardScaler
 A simple Standard Scaler class. More...

 
class  StringEncoding
 The class translates a set of strings into numbers using various encoding algorithms. More...

 
class  StringEncodingDictionary
 This class provides a dictionary interface for the purpose of string encoding. More...

 
class  StringEncodingDictionary< boost::string_view >
 
class  StringEncodingDictionary< int >
 
struct  StringEncodingPolicyTraits
 This is a template struct that provides some information about various encoding policies. More...

 
struct  StringEncodingPolicyTraits< DictionaryEncodingPolicy >
 The specialization provides some information about the dictionary encoding policy. More...

 
class  TfIdfEncodingPolicy
 Definition of the TfIdfEncodingPolicy class. More...

 
class  ZCAWhitening
 A simple ZCAWhitening class. More...

 

Typedefs

template
<
typename
TokenType
>
using BagOfWordsEncoding = StringEncoding< BagOfWordsEncodingPolicy, StringEncodingDictionary< TokenType > >
 A convenient alias for the StringEncoding class with BagOfWordsEncodingPolicy and the default dictionary for the given token type. More...

 
using DatasetInfo = DatasetMapper< data::IncrementPolicy >
 
template
<
typename
TokenType
>
using DictionaryEncoding = StringEncoding< DictionaryEncodingPolicy, StringEncodingDictionary< TokenType > >
 A convenient alias for the StringEncoding class with DictionaryEncodingPolicy and the default dictionary for the given token type. More...

 
template
<
typename
TokenType
>
using TfIdfEncoding = StringEncoding< TfIdfEncodingPolicy, StringEncodingDictionary< TokenType > >
 A convenient alias for the StringEncoding class with TfIdfEncodingPolicy and the default dictionary for the given token type. More...

 

Enumerations

enum  Datatype
: bool {
  numeric
= 0,
  categorical
= 1
}
 The Datatype enum specifies the types of data mlpack algorithms can use. More...

 
enum  format
{
  autodetect
,
  json
,
  xml
,
  binary

}
 Define the formats we can read through cereal. More...

 

Functions

arma::file_type AutoDetect (std::fstream &stream, const std::string &filename)
 Attempt to auto-detect the type of a file given its extension, and by inspecting the parts of the file to disambiguate between types when necessary. More...

 
template
<
typename
T
>
void Binarize (const arma::Mat< T > &input, arma::Mat< T > &output, const double threshold)
 Given an input dataset and threshold, set values greater than threshold to 1 and values less than or equal to the threshold to 0. More...

 
template
<
typename
T
>
void Binarize (const arma::Mat< T > &input, arma::Mat< T > &output, const double threshold, const size_t dimension)
 Given an input dataset and threshold, set values greater than threshold to 1 and values less than or equal to the threshold to 0. More...

 
template
<
typename
eT
>
void ConfusionMatrix (const arma::Row< size_t > predictors, const arma::Row< size_t > responses, arma::Mat< eT > &output, const size_t numClasses)
 A confusion matrix is a summary of prediction results on a classification problem. More...

 
arma::file_type DetectFromExtension (const std::string &filename)
 Return the type based only on the extension. More...

 
std::string Extension (const std::string &filename)
 
std::string GetStringType (const arma::file_type &type)
 Given a file type, return a logical name corresponding to that file type. More...

 
arma::file_type GuessFileType (std::istream &f)
 Given an istream, attempt to guess the file type. More...

 
 HAS_EXACT_METHOD_FORM (serialize, HasSerializeCheck)
 
bool ImageFormatSupported (const std::string &fileName, const bool save=false)
 Checks if the given image filename is supported. More...

 
template
<
typename
T
>
bool IsNaNInf (T &val, const std::string &token)
 See if the token is a NaN or an Inf, and if so, set the value accordingly and return a boolean representing whether or not it is. More...

 
template
<
typename
eT
>
bool Load (const std::string &filename, arma::Mat< eT > &matrix, const bool fatal=false, const bool transpose=true, const arma::file_type inputLoadType=arma::auto_detect)
 Loads a matrix from file, guessing the filetype from the extension. More...

 
template
<
typename
eT
>
bool Load (const std::string &filename, arma::SpMat< eT > &matrix, const bool fatal=false, const bool transpose=true)
 Loads a sparse matrix from file, using arma::coord_ascii format. More...

 
template
<
typename
eT
>
bool Load (const std::string &filename, arma::Col< eT > &vec, const bool fatal=false)
 Don't document these with doxygen; these declarations aren't helpful to users. More...

 
template
<
typename
eT
>
bool Load (const std::string &filename, arma::Row< eT > &rowvec, const bool fatal=false)
 Load a row vector from a file, guessing the filetype from the extension. More...

 
template
<
typename
eT
,
typename
PolicyType
>
bool Load (const std::string &filename, arma::Mat< eT > &matrix, DatasetMapper< PolicyType > &info, const bool fatal=false, const bool transpose=true)
 Loads a matrix from a file, guessing the filetype from the extension and mapping categorical features with a DatasetMapper object. More...

 
template
<
typename
T
>
bool Load (const std::string &filename, const std::string &name, T &t, const bool fatal=false, format f=format::autodetect)
 Don't document these with doxygen; they aren't helpful for users to know about. More...

 
template
<
typename
eT
>
bool Load (const std::string &filename, arma::Mat< eT > &matrix, ImageInfo &info, const bool fatal=false)
 Image load/save interfaces. More...

 
template
<
typename
eT
>
bool Load (const std::vector< std::string > &files, arma::Mat< eT > &matrix, ImageInfo &info, const bool fatal=false)
 Load the image file into the given matrix. More...

 
template
<
typename
eT
>
void LoadARFF (const std::string &filename, arma::Mat< eT > &matrix)
 A utility function to load an ARFF dataset as numeric features (that is, as an Armadillo matrix without any modification). More...

 
template
<
typename
eT
,
typename
PolicyType
>
void LoadARFF (const std::string &filename, arma::Mat< eT > &matrix, DatasetMapper< PolicyType > &info)
 A utility function to load an ARFF dataset as numeric and categorical features, using the DatasetInfo structure for mapping. More...

 
bool LoadImage (const std::string &filename, arma::Mat< unsigned char > &matrix, ImageInfo &info, const bool fatal=false)
 
template
<
typename
eT
,
typename
RowType
>
void NormalizeLabels (const RowType &labelsIn, arma::Row< size_t > &labels, arma::Col< eT > &mapping)
 Given a set of labels of a particular datatype, convert them to unsigned labels in the range [0, n) where n is the number of different labels. More...

 
template
<
typename
RowType
,
typename
MatType
>
void OneHotEncoding (const RowType &labelsIn, MatType &output)
 Given a set of labels of a particular datatype, convert them to binary vector. More...

 
template
<
typename
eT
>
void OneHotEncoding (const arma::Mat< eT > &input, const arma::Col< size_t > &indices, arma::Mat< eT > &output)
 Overloaded function for the above function, which takes a matrix as input and also a vector of indices to encode and outputs a matrix. More...

 
template
<
typename
eT
>
void OneHotEncoding (const arma::Mat< eT > &input, arma::Mat< eT > &output, const data::DatasetInfo &datasetInfo)
 Overloaded function for the above function, which takes a matrix as input and also a DatasetInfo object and outputs a matrix. More...

 
template
<
typename
eT
>
void RevertLabels (const arma::Row< size_t > &labels, const arma::Col< eT > &mapping, arma::Row< eT > &labelsOut)
 Given a set of labels that have been mapped to the range [0, n), map them back to the original labels given by the 'mapping' vector. More...

 
template
<
typename
eT
>
bool Save (const std::string &filename, const arma::Mat< eT > &matrix, const bool fatal=false, bool transpose=true, arma::file_type inputSaveType=arma::auto_detect)
 Saves a matrix to file, guessing the filetype from the extension. More...

 
template
<
typename
eT
>
bool Save (const std::string &filename, const arma::SpMat< eT > &matrix, const bool fatal=false, bool transpose=true)
 Saves a sparse matrix to file, guessing the filetype from the extension. More...

 
template
<
typename
T
>
bool Save (const std::string &filename, const std::string &name, T &t, const bool fatal=false, format f=format::autodetect)
 Saves a model to file, guessing the filetype from the extension, or, optionally, saving the specified format. More...

 
template
<
typename
eT
>
bool Save (const std::string &filename, arma::Mat< eT > &matrix, ImageInfo &info, const bool fatal=false)
 Save the image file from the given matrix. More...

 
template
<
typename
eT
>
bool Save (const std::vector< std::string > &files, arma::Mat< eT > &matrix, ImageInfo &info, const bool fatal=false)
 Save the image file from the given matrix. More...

 
bool SaveImage (const std::string &filename, arma::Mat< unsigned char > &image, ImageInfo &info, const bool fatal=false)
 Helper function to save files. More...

 
template<typename T , typename LabelsType , typename = std::enable_if_t<arma::is_arma_type<LabelsType>::value>>
void Split (const arma::Mat< T > &input, const LabelsType &inputLabel, arma::Mat< T > &trainData, arma::Mat< T > &testData, LabelsType &trainLabel, LabelsType &testLabel, const double testRatio, const bool shuffleData=true)
 Given an input dataset and labels, split into a training set and test set. More...

 
template
<
typename
T
>
void Split (const arma::Mat< T > &input, arma::Mat< T > &trainData, arma::Mat< T > &testData, const double testRatio, const bool shuffleData=true)
 Given an input dataset, split into a training set and test set. More...

 
template<typename T , typename LabelsType , typename = std::enable_if_t<arma::is_arma_type<LabelsType>::value>>
std::tuple< arma::Mat< T >, arma::Mat< T >, LabelsType, LabelsType > Split (const arma::Mat< T > &input, const LabelsType &inputLabel, const double testRatio, const bool shuffleData=true, const bool stratifyData=false)
 Given an input dataset and labels, split into a training set and test set. More...

 
template
<
typename
T
>
std::tuple< arma::Mat< T >, arma::Mat< T > > Split (const arma::Mat< T > &input, const double testRatio, const bool shuffleData=true)
 Given an input dataset, split into a training set and test set. More...

 
template<typename FieldType , typename T , typename = std::enable_if_t< arma::is_Col<typename FieldType::object_type>::value || arma::is_Mat_only<typename FieldType::object_type>::value>>
void Split (const FieldType &input, const arma::field< T > &inputLabel, FieldType &trainData, arma::field< T > &trainLabel, FieldType &testData, arma::field< T > &testLabel, const double testRatio, const bool shuffleData=true)
 Given an input dataset and labels, split into a training set and test set. More...

 
template<class FieldType , class = std::enable_if_t< arma::is_Col<typename FieldType::object_type>::value || arma::is_Mat_only<typename FieldType::object_type>::value>>
void Split (const FieldType &input, FieldType &trainData, FieldType &testData, const double testRatio, const bool shuffleData=true)
 Given an input dataset, split into a training set and test set. More...

 
template<class FieldType , typename T , class = std::enable_if_t< arma::is_Col<typename FieldType::object_type>::value || arma::is_Mat_only<typename FieldType::object_type>::value>>
std::tuple< FieldType, FieldType, arma::field< T >, arma::field< T > > Split (const FieldType &input, const arma::field< T > &inputLabel, const double testRatio, const bool shuffleData=true)
 Given an input dataset and labels, split into a training set and test set. More...

 
template<class FieldType , class = std::enable_if_t< arma::is_Col<typename FieldType::object_type>::value || arma::is_Mat_only<typename FieldType::object_type>::value>>
std::tuple< FieldType, FieldType > Split (const FieldType &input, const double testRatio, const bool shuffleData=true)
 Given an input dataset, split into a training set and test set. More...

 
template
<
typename
InputType
>
void SplitHelper (const InputType &input, InputType &train, InputType &test, const double testRatio, const arma::uvec &order=arma::uvec())
 This helper function splits any input data into training and testing parts. More...

 
template<typename T , typename LabelsType , typename = std::enable_if_t<arma::is_arma_type<LabelsType>::value>>
void StratifiedSplit (const arma::Mat< T > &input, const LabelsType &inputLabel, arma::Mat< T > &trainData, arma::Mat< T > &testData, LabelsType &trainLabel, LabelsType &testLabel, const double testRatio, const bool shuffleData=true)
 Given an input dataset and labels, stratify into a training set and test set. More...

 

Detailed Description

Functions to load and save matrices and models.

Functions to load and save matrices.

Typedef Documentation

◆ BagOfWordsEncoding

A convenient alias for the StringEncoding class with BagOfWordsEncodingPolicy and the default dictionary for the given token type.

Template Parameters
TokenTypeType of the tokens.

Definition at line 167 of file bag_of_words_encoding_policy.hpp.

◆ DatasetInfo

typedef DatasetMapper< IncrementPolicy, std::string > DatasetInfo

Definition at line 196 of file dataset_mapper.hpp.

◆ DictionaryEncoding

A convenient alias for the StringEncoding class with DictionaryEncodingPolicy and the default dictionary for the given token type.

Template Parameters
TokenTypeType of the tokens.

Definition at line 146 of file dictionary_encoding_policy.hpp.

◆ TfIdfEncoding

A convenient alias for the StringEncoding class with TfIdfEncodingPolicy and the default dictionary for the given token type.

Template Parameters
TokenTypeType of the tokens.

Definition at line 345 of file tf_idf_encoding_policy.hpp.

Enumeration Type Documentation

◆ Datatype

enum Datatype : bool

The Datatype enum specifies the types of data mlpack algorithms can use.

The vast majority of mlpack algorithms can only use numeric data (i.e. float/double/etc.), but some algorithms can use categorical data, specified via this Datatype enum and the DatasetMapper class.

Enumerator
numeric 
categorical 

Definition at line 24 of file datatype.hpp.

◆ format

enum format

Define the formats we can read through cereal.

Enumerator
autodetect 
json 
xml 
binary 

Definition at line 20 of file format.hpp.

Function Documentation

◆ AutoDetect()

arma::file_type mlpack::data::AutoDetect ( std::fstream &  stream,
const std::string &  filename 
)

Attempt to auto-detect the type of a file given its extension, and by inspecting the parts of the file to disambiguate between types when necessary.

(For instance, a .csv file could be delimited by spaces, commas, or tabs.) This is meant to be used during loading.

If the file is detected as a CSV, and the CSV is detected to have a header row, stream will be fast-forwarded to point at the second line of the file.

Parameters
streamOpened file stream to look into for autodetection.
filenameName of the file.
Returns
The detected file type. arma::file_type_unknown if unknown.

◆ Binarize() [1/2]

void mlpack::data::Binarize ( const arma::Mat< T > &  input,
arma::Mat< T > &  output,
const double  threshold 
)

Given an input dataset and threshold, set values greater than threshold to 1 and values less than or equal to the threshold to 0.

This overload applies the changes to all dimensions.

arma::Mat<double> input = loadData();
arma::Mat<double> output;
double threshold = 0.5;
// Binarize the whole Matrix. All positive values in will be set to 1 and
// the values less than or equal to 0.5 will become 0.
Binarize<double>(input, output, threshold);
Parameters
inputInput matrix to Binarize.
outputMatrix you want to save binarized data into.
thresholdThreshold can by any number.

Definition at line 41 of file binarize.hpp.

References omp_size_t.

◆ Binarize() [2/2]

void mlpack::data::Binarize ( const arma::Mat< T > &  input,
arma::Mat< T > &  output,
const double  threshold,
const size_t  dimension 
)

Given an input dataset and threshold, set values greater than threshold to 1 and values less than or equal to the threshold to 0.

This overload takes a dimension and applys the changes to the given dimension.

arma::Mat<double> input = loadData();
arma::Mat<double> output;
double threshold = 0.5;
size_t dimension = 0;
// Binarize the first dimension. All positive values in the first dimension
// will be set to 1 and the values less than or equal to 0 will become 0.
Binarize<double>(input, output, threshold, dimension);
Parameters
inputInput matrix to Binarize.
outputMatrix you want to save binarized data into.
thresholdThreshold can by any number.
dimensionFeature to apply the Binarize function.

Definition at line 77 of file binarize.hpp.

References omp_size_t.

◆ ConfusionMatrix()

void mlpack::data::ConfusionMatrix ( const arma::Row< size_t >  predictors,
const arma::Row< size_t >  responses,
arma::Mat< eT > &  output,
const size_t  numClasses 
)

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized by count and broken down by each class. For example, for 2 classes, the function call will be

ConfusionMatrix(predictors, responses, output, 2)

In this case, the output matrix will be of size 2 * 2:

0 1
0 TP FN
1 FP TN

The confusion matrix for two labels will look like what is shown above. In this confusion matrix, TP represents the number of true positives, FP represents the number of false positives, FN represents the number of false negatives, and TN represents the number of true negatives.

When generalizing to 2 or more classes, the row index of the confusion matrix represents the predicted classes and column index represents the actual class.

Parameters
predictorsVector of data points.
responsesThe measured data for each point.
outputMatrix which is represented as confusion matrix.
numClassesNumber of classes.

◆ DetectFromExtension()

arma::file_type mlpack::data::DetectFromExtension ( const std::string &  filename)

Return the type based only on the extension.

Parameters
filenameName of the file whose type we should detect.
Returns
Detected type of file. arma::file_type_unknown if unknown.

◆ Extension()

std::string mlpack::data::Extension ( const std::string &  filename)
inline

Definition at line 21 of file extension.hpp.

◆ GetStringType()

std::string mlpack::data::GetStringType ( const arma::file_type &  type)

Given a file type, return a logical name corresponding to that file type.

Parameters
typeType to get the logical name of.

◆ GuessFileType()

arma::file_type mlpack::data::GuessFileType ( std::istream &  f)

Given an istream, attempt to guess the file type.

This is taken originally from Armadillo's function guess_file_type_internal(), but we avoid using internal Armadillo functionality.

If the file is detected as a CSV, and the CSV is detected to have a header row, the stream f will be fast-forwarded to point at the second line of the file.

Parameters
fOpened istream to look into to guess the file type.

◆ HAS_EXACT_METHOD_FORM()

mlpack::data::HAS_EXACT_METHOD_FORM ( serialize  ,
HasSerializeCheck   
)

◆ ImageFormatSupported()

bool mlpack::data::ImageFormatSupported ( const std::string &  fileName,
const bool  save = false 
)
inline

Checks if the given image filename is supported.

Parameters
fileNameName of the image file.
saveSet to true to check if the file format can be saved, else loaded.
Returns
Boolean value indicating success if it is an image.

◆ IsNaNInf()

bool mlpack::data::IsNaNInf ( T &  val,
const std::string &  token 
)
inline

See if the token is a NaN or an Inf, and if so, set the value accordingly and return a boolean representing whether or not it is.

Definition at line 27 of file is_naninf.hpp.

◆ Load() [1/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::Mat< eT > &  matrix,
const bool  fatal = false,
const bool  transpose = true,
const arma::file_type  inputLoadType = arma::auto_detect 
)

Loads a matrix from file, guessing the filetype from the extension.

This will transpose the matrix at load time (unless the transpose parameter is set to false).

The supported types of files are the same as found in Armadillo:

  • CSV (arma::csv_ascii), denoted by .csv, or optionally .txt
  • TSV (arma::raw_ascii), denoted by .tsv, .csv, or .txt
  • ASCII (arma::raw_ascii), denoted by .txt
  • Armadillo ASCII (arma::arma_ascii), also denoted by .txt
  • PGM (arma::pgm_binary), denoted by .pgm
  • PPM (arma::ppm_binary), denoted by .ppm
  • Raw binary (arma::raw_binary), denoted by .bin
  • Armadillo binary (arma::arma_binary), denoted by .bin
  • HDF5 (arma::hdf5_binary), denoted by .hdf, .hdf5, .h5, or .he5

By default, this function will try to automatically determine the type of file to load based on its extension and by inspecting the file. If you know the file type and want to specify it manually, override the default inputLoadType parameter with the correct type above (e.g. arma::csv_ascii.)

If the detected file type is CSV (arma::csv_ascii), the first row will be checked for a CSV header. If a CSV header is not detected, the first row will be treated as data; otherwise, the first row will be skipped.

If the parameter 'fatal' is set to true, a std::runtime_error exception will be thrown if the matrix does not load successfully. The parameter 'transpose' controls whether or not the matrix is transposed after loading. In most cases, because data is generally stored in a row-major format and mlpack requires column-major matrices, this should be left at its default value of 'true'.

Parameters
filenameName of file to load.
matrixMatrix to load contents of file into.
fatalIf an error should be reported as fatal (default false).
transposeIf true, transpose the matrix after loading (default true).
inputLoadTypeUsed to determine the type of file to load (default arma::auto_detect).
Returns
Boolean value indicating success or failure of load.

Referenced by mlpack::bindings::cli::GetParam(), and LoadBostonHousingDataset().

◆ Load() [2/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::SpMat< eT > &  matrix,
const bool  fatal = false,
const bool  transpose = true 
)

Loads a sparse matrix from file, using arma::coord_ascii format.

This will transpose the matrix at load time (unless the transpose parameter is set to false). If the filetype cannot be determined, an error will be given.

The supported types of files are the same as found in Armadillo:

  • TSV (coord_ascii), denoted by .tsv or .txt
  • TXT (coord_ascii), denoted by .txt
  • Raw binary (raw_binary), denoted by .bin
  • Armadillo binary (arma_binary), denoted by .bin

If the file extension is not one of those types, an error will be given. This is preferable to Armadillo's default behavior of loading an unknown filetype as raw_binary, which can have very confusing effects.

If the parameter 'fatal' is set to true, a std::runtime_error exception will be thrown if the matrix does not load successfully. The parameter 'transpose' controls whether or not the matrix is transposed after loading. In most cases, because data is generally stored in a row-major format and mlpack requires column-major matrices, this should be left at its default value of 'true'.

Parameters
filenameName of file to load.
matrixSparse matrix to load contents of file into.
fatalIf an error should be reported as fatal (default false).
transposeIf true, transpose the matrix after loading (default true).
Returns
Boolean value indicating success or failure of load.

◆ Load() [3/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::Col< eT > &  vec,
const bool  fatal = false 
)

Don't document these with doxygen; these declarations aren't helpful to users.

Load a column vector from a file, guessing the filetype from the extension.

The supported types of files are the same as found in Armadillo:

  • CSV (csv_ascii), denoted by .csv, or optionally .txt
  • TSV (raw_ascii), denoted by .tsv, .csv, or .txt
  • ASCII (raw_ascii), denoted by .txt
  • Armadillo ASCII (arma_ascii), also denoted by .txt
  • PGM (pgm_binary), denoted by .pgm
  • PPM (ppm_binary), denoted by .ppm
  • Raw binary (raw_binary), denoted by .bin
  • Armadillo binary (arma_binary), denoted by .bin
  • HDF5, denoted by .hdf, .hdf5, .h5, or .he5

If the file extension is not one of those types, an error will be given. This is preferable to Armadillo's default behavior of loading an unknown filetype as raw_binary, which can have very confusing effects.

If the parameter 'fatal' is set to true, a std::runtime_error exception will be thrown if the matrix does not load successfully.

Parameters
filenameName of file to load.
vecColumn vector to load contents of file into.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ Load() [4/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::Row< eT > &  rowvec,
const bool  fatal = false 
)

Load a row vector from a file, guessing the filetype from the extension.

The supported types of files are the same as found in Armadillo:

  • CSV (csv_ascii), denoted by .csv, or optionally .txt
  • TSV (raw_ascii), denoted by .tsv, .csv, or .txt
  • ASCII (raw_ascii), denoted by .txt
  • Armadillo ASCII (arma_ascii), also denoted by .txt
  • PGM (pgm_binary), denoted by .pgm
  • PPM (ppm_binary), denoted by .ppm
  • Raw binary (raw_binary), denoted by .bin
  • Armadillo binary (arma_binary), denoted by .bin
  • HDF5, denoted by .hdf, .hdf5, .h5, or .he5

If the file extension is not one of those types, an error will be given. This is preferable to Armadillo's default behavior of loading an unknown filetype as raw_binary, which can have very confusing effects.

If the parameter 'fatal' is set to true, a std::runtime_error exception will be thrown if the matrix does not load successfully.

Parameters
filenameName of file to load.
rowvecRow vector to load contents of file into.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ Load() [5/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::Mat< eT > &  matrix,
DatasetMapper< PolicyType > &  info,
const bool  fatal = false,
const bool  transpose = true 
)

Loads a matrix from a file, guessing the filetype from the extension and mapping categorical features with a DatasetMapper object.

This will transpose the matrix (unless the transpose parameter is set to false). This particular overload of Load() can only load text-based formats, such as those given below:

  • CSV (csv_ascii), denoted by .csv, or optionally .txt
  • TSV (raw_ascii), denoted by .tsv, .csv, or .txt
  • ASCII (raw_ascii), denoted by .txt

If the file extension is not one of those types, an error will be given. This is preferable to Armadillo's default behavior of loading an unknown filetype as raw_binary, which can have very confusing effects.

If the parameter 'fatal' is set to true, a std::runtime_error exception will be thrown if the matrix does not load successfully. The parameter 'transpose' controls whether or not the matrix is transposed after loading. In most cases, because data is generally stored in a row-major format and mlpack requires column-major matrices, this should be left at its default value of 'true'.

If the given info has already been used with a different data::Load() call where the dataset has the same dimensionality, then the mappings and dimension types inside of info will be re-used. If the given info is a new DatasetMapper object (e.g. its dimensionality is 0), then new mappings will be created. If the given info has a different dimensionality of data than what is present in filename, an exception will be thrown.

Parameters
filenameName of file to load.
matrixMatrix to load contents of file into.
infoDatasetMapper object to populate with mappings and data types.
fatalIf an error should be reported as fatal (default false).
transposeIf true, transpose the matrix after loading.
Returns
Boolean value indicating success or failure of load.

◆ Load() [6/8]

bool mlpack::data::Load ( const std::string &  filename,
const std::string &  name,
T &  t,
const bool  fatal = false,
format  f = format::autodetect 
)

Don't document these with doxygen; they aren't helpful for users to know about.

Load a model from a file, guessing the filetype from the extension, or, optionally, loading the specified format. If automatic extension detection is used and the filetype cannot be determined, an error will be given.

The supported types of files are the same as what is supported by the cereal library:

  • json, denoted by .json
  • xml, denoted by .xml
  • binary, denoted by .bin

The format parameter can take any of the values in the 'format' enum: 'format::autodetect', 'format::json', 'format::xml', and 'format::binary'. The autodetect functionality operates on the file extension (so, "file.txt" would be autodetected as text).

The name parameter should be specified to indicate the name of the structure to be loaded. This should be the same as the name that was used to save the structure (otherwise, the loading procedure will fail).

If the parameter 'fatal' is set to true, then an exception will be thrown in the event of load failure. Otherwise, the method will return false and the relevant error information will be printed to Log::Warn.

◆ Load() [7/8]

bool mlpack::data::Load ( const std::string &  filename,
arma::Mat< eT > &  matrix,
ImageInfo info,
const bool  fatal = false 
)

Image load/save interfaces.

Load the image file into the given matrix.

Parameters
filenameName of the image file.
matrixMatrix to load the image into.
infoAn object of ImageInfo class.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ Load() [8/8]

bool mlpack::data::Load ( const std::vector< std::string > &  files,
arma::Mat< eT > &  matrix,
ImageInfo info,
const bool  fatal = false 
)

Load the image file into the given matrix.

Parameters
filesA vector consisting of filenames.
matrixMatrix to save the image from.
infoAn object of ImageInfo class.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ LoadARFF() [1/2]

void mlpack::data::LoadARFF ( const std::string &  filename,
arma::Mat< eT > &  matrix 
)

A utility function to load an ARFF dataset as numeric features (that is, as an Armadillo matrix without any modification).

An exception will be thrown if any features are non-numeric.

◆ LoadARFF() [2/2]

void mlpack::data::LoadARFF ( const std::string &  filename,
arma::Mat< eT > &  matrix,
DatasetMapper< PolicyType > &  info 
)

A utility function to load an ARFF dataset as numeric and categorical features, using the DatasetInfo structure for mapping.

An exception will be thrown upon failure.

A pre-existing DatasetInfo object can be passed in, but if the dimensionality of the given DatasetInfo object (info.Dimensionality()) does not match the dimensionality of the data, a std::invalid_argument exception will be thrown. If an empty DatasetInfo object is given (constructed with the default constructor or otherwise, so that info.Dimensionality() is 0), it will be set to the right dimensionality.

This ability to pass in pre-existing DatasetInfo objects is very necessary when, e.g., loading a test set after training. If the same DatasetInfo from loading the training set is not used, then the test set may be loaded with different mappings—which can cause horrible problems!

Parameters
filenameName of ARFF file to load.
matrixMatrix to load data into.
infoDatasetInfo object; can be default-constructed or pre-existing from another call to LoadARFF().

◆ LoadImage()

bool mlpack::data::LoadImage ( const std::string &  filename,
arma::Mat< unsigned char > &  matrix,
ImageInfo info,
const bool  fatal = false 
)

◆ NormalizeLabels()

void mlpack::data::NormalizeLabels ( const RowType &  labelsIn,
arma::Row< size_t > &  labels,
arma::Col< eT > &  mapping 
)

Given a set of labels of a particular datatype, convert them to unsigned labels in the range [0, n) where n is the number of different labels.

Also, a reverse mapping from the new label to the old value is stored in the 'mapping' vector.

Parameters
labelsInInput labels of arbitrary datatype.
labelsVector that unsigned labels will be stored in.
mappingReverse mapping to convert new labels back to old labels.

◆ OneHotEncoding() [1/3]

void mlpack::data::OneHotEncoding ( const RowType &  labelsIn,
MatType &  output 
)

Given a set of labels of a particular datatype, convert them to binary vector.

The categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

Parameters
labelsInInput labels of arbitrary datatype.
outputBinary matrix.

◆ OneHotEncoding() [2/3]

void mlpack::data::OneHotEncoding ( const arma::Mat< eT > &  input,
const arma::Col< size_t > &  indices,
arma::Mat< eT > &  output 
)

Overloaded function for the above function, which takes a matrix as input and also a vector of indices to encode and outputs a matrix.

Indices represent the IDs of the dimensions to be one-hot encoded.

Parameters
inputInput dataset to be encoded.
indicesIndex of rows to be encoded.
outputEncoded matrix.

◆ OneHotEncoding() [3/3]

void mlpack::data::OneHotEncoding ( const arma::Mat< eT > &  input,
arma::Mat< eT > &  output,
const data::DatasetInfo datasetInfo 
)

Overloaded function for the above function, which takes a matrix as input and also a DatasetInfo object and outputs a matrix.

This function encodes all the dimensions marked Datatype::categorical in the data::DatasetInfo.

Parameters
inputInput dataset to be encoded.
outputEncoded matrix.
datasetInfoDatasetInfo object that has information about data.

◆ RevertLabels()

void mlpack::data::RevertLabels ( const arma::Row< size_t > &  labels,
const arma::Col< eT > &  mapping,
arma::Row< eT > &  labelsOut 
)

Given a set of labels that have been mapped to the range [0, n), map them back to the original labels given by the 'mapping' vector.

Parameters
labelsSet of normalized labels to convert.
mappingMapping to use to convert labels.
labelsOutVector to store new labels in.

◆ Save() [1/5]

bool mlpack::data::Save ( const std::string &  filename,
const arma::Mat< eT > &  matrix,
const bool  fatal = false,
bool  transpose = true,
arma::file_type  inputSaveType = arma::auto_detect 
)

Saves a matrix to file, guessing the filetype from the extension.

This will transpose the matrix at save time. If the filetype cannot be determined, an error will be given.

The supported types of files are the same as found in Armadillo:

  • CSV (arma::csv_ascii), denoted by .csv, or optionally .txt
  • ASCII (arma::raw_ascii), denoted by .txt
  • Armadillo ASCII (arma::arma_ascii), also denoted by .txt
  • PGM (arma::pgm_binary), denoted by .pgm
  • PPM (arma::ppm_binary), denoted by .ppm
  • Raw binary (arma::raw_binary), denoted by .bin
  • Armadillo binary (arma::arma_binary), denoted by .bin
  • HDF5 (arma::hdf5_binary), denoted by .hdf5, .hdf, .h5, or .he5

By default, this function will try to automatically determine the format to save with based only on the filename's extension. If you would prefer to specify a file type manually, override the default inputSaveType parameter with the correct type above (e.g. arma::csv_ascii.)

If the 'fatal' parameter is set to true, a std::runtime_error exception will be thrown upon failure. If the 'transpose' parameter is set to true, the matrix will be transposed before saving. Generally, because mlpack stores matrices in a column-major format and most datasets are stored on disk as row-major, this parameter should be left at its default value of 'true'.

Parameters
filenameName of file to save to.
matrixMatrix to save into file.
fatalIf an error should be reported as fatal (default false).
transposeIf true, transpose the matrix before saving (default true).
inputSaveTypeFile type to save to (defaults to arma::auto_detect).
Returns
Boolean value indicating success or failure of save.

◆ Save() [2/5]

bool mlpack::data::Save ( const std::string &  filename,
const arma::SpMat< eT > &  matrix,
const bool  fatal = false,
bool  transpose = true 
)

Saves a sparse matrix to file, guessing the filetype from the extension.

This will transpose the matrix at save time. If the filetype cannot be determined, an error will be given.

The supported types of files are the same as found in Armadillo:

  • TSV (coord_ascii), denoted by .tsv or .txt
  • TXT (coord_ascii), denoted by .txt
  • Raw binary (raw_binary), denoted by .bin
  • Armadillo binary (arma_binary), denoted by .bin

If the file extension is not one of those types, an error will be given. If the 'fatal' parameter is set to true, a std::runtime_error exception will be thrown upon failure. If the 'transpose' parameter is set to true, the matrix will be transposed before saving. Generally, because mlpack stores matrices in a column-major format and most datasets are stored on disk as row-major, this parameter should be left at its default value of 'true'.

Parameters
filenameName of file to save to.
matrixSparse matrix to save into file.
fatalIf an error should be reported as fatal (default false).
transposeIf true, transpose the matrix before saving (default true).
Returns
Boolean value indicating success or failure of save.

◆ Save() [3/5]

bool mlpack::data::Save ( const std::string &  filename,
const std::string &  name,
T &  t,
const bool  fatal = false,
format  f = format::autodetect 
)

Saves a model to file, guessing the filetype from the extension, or, optionally, saving the specified format.

If automatic extension detection is used and the filetype cannot be determined, and error will be given.

The supported types of files are the same as what is supported by the cereal library:

  • json, denoted by .json
  • xml, denoted by .xml
  • binary, denoted by .bin

The format parameter can take any of the values in the 'format' enum: 'format::autodetect', 'format::json', 'format::xml', and 'format::binary'. The autodetect functionality operates on the file extension (so, "file.txt" would be autodetected as text).

The name parameter should be specified to indicate the name of the structure to be saved. If Load() is later called on the generated file, the name used to load should be the same as the name used for this call to Save().

If the parameter 'fatal' is set to true, then an exception will be thrown in the event of a save failure. Otherwise, the method will return false and the relevant error information will be printed to Log::Warn.

◆ Save() [4/5]

bool mlpack::data::Save ( const std::string &  filename,
arma::Mat< eT > &  matrix,
ImageInfo info,
const bool  fatal = false 
)

Save the image file from the given matrix.

Parameters
filenameName of the image file.
matrixMatrix to save the image from.
infoAn object of ImageInfo class.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ Save() [5/5]

bool mlpack::data::Save ( const std::vector< std::string > &  files,
arma::Mat< eT > &  matrix,
ImageInfo info,
const bool  fatal = false 
)

Save the image file from the given matrix.

Parameters
filesA vector consisting of filenames.
matrixMatrix to save the image from.
infoAn object of ImageInfo class.
fatalIf an error should be reported as fatal (default false).
Returns
Boolean value indicating success or failure of load.

◆ SaveImage()

bool mlpack::data::SaveImage ( const std::string &  filename,
arma::Mat< unsigned char > &  image,
ImageInfo info,
const bool  fatal = false 
)

Helper function to save files.

Implementation in save_image.cpp.

◆ Split() [1/8]

void mlpack::data::Split ( const arma::Mat< T > &  input,
const LabelsType &  inputLabel,
arma::Mat< T > &  trainData,
arma::Mat< T > &  testData,
LabelsType &  trainLabel,
LabelsType &  testLabel,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset and labels, split into a training set and test set.

Example usage below. This overload places the split dataset into the four output parameters given (trainData, testData, trainLabel, and testLabel).

arma::mat input = loadData();
arma::Row<size_t> label = loadLabel();
arma::mat trainData;
arma::mat testData;
arma::Row<size_t> trainLabel;
arma::Row<size_t> testLabel;
math::RandomSeed(100); // Set the seed if you like.
// Split the dataset into a training and test set, with 30% of the data being
// held out for the test set.
Split(input, label, trainData,
testData, trainLabel, testLabel, 0.3);
Template Parameters
TType of the elements of the input matrix.
LabelsTypeType of input labels. It can be arma::Mat, arma::Row, arma::Cube or arma::SpMat.
Parameters
inputInput dataset to split.
inputLabelInput labels to split.
trainDataMatrix to store training data into.
testDataMatrix to store test data into.
trainLabelVector to store training labels into.
testLabelVector to store test labels into.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true.)

Definition at line 256 of file split_data.hpp.

References mlpack::util::CheckSameSizes(), and SplitHelper().

Referenced by LoadBostonHousingDataset(), and Split().

◆ Split() [2/8]

void mlpack::data::Split ( const arma::Mat< T > &  input,
arma::Mat< T > &  trainData,
arma::Mat< T > &  testData,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset, split into a training set and test set.

Example usage below. This overload places the split dataset into the two output parameters given (trainData, testData).

arma::mat input = loadData();
arma::mat trainData;
arma::mat testData;
math::RandomSeed(100); // Set the seed if you like.
// Split the dataset into a training and test set, with 30% of the data being
// held out for the test set.
Split(input, trainData, testData, 0.3);
Parameters
inputInput dataset to split.
trainDataMatrix to store training data into.
testDataMatrix to store test data into.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).

Definition at line 304 of file split_data.hpp.

References SplitHelper().

◆ Split() [3/8]

std::tuple<arma::Mat<T>, arma::Mat<T>, LabelsType, LabelsType> mlpack::data::Split ( const arma::Mat< T > &  input,
const LabelsType &  inputLabel,
const double  testRatio,
const bool  shuffleData = true,
const bool  stratifyData = false 
)

Given an input dataset and labels, split into a training set and test set.

Example usage below. This overload returns the split dataset as a std::tuple with four elements: an arma::Mat<T> containing the training data, an arma::Mat<T> containing the test data, an arma::Row<U> containing the training labels, and an arma::Row<U> containing the test labels.

arma::mat input = loadData();
arma::Row<size_t> label = loadLabel();
auto splitResult = Split(input, label, 0.2);
Template Parameters
TType of the elements of the input matrix.
LabelsTypeType of input labels. It can be arma::Mat, arma::Row, arma::Cube or arma::SpMat.
Parameters
inputInput dataset to split.
inputLabelInput labels to split.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).
stratifyDataIf true, the train and test splits are stratified so that the ratio of each class in the training and test sets is the same as in the original dataset. Expects labels to be of type arma::Row<> or arma::Col<>.
Returns
std::tuple containing trainData (arma::Mat<T>), testData (arma::Mat<T>), trainLabel (arma::Row<U>), and testLabel (arma::Row<U>).

Definition at line 353 of file split_data.hpp.

References Split(), and StratifiedSplit().

◆ Split() [4/8]

std::tuple<arma::Mat<T>, arma::Mat<T> > mlpack::data::Split ( const arma::Mat< T > &  input,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset, split into a training set and test set.

Example usage below. This overload returns the split dataset as a std::tuple with two elements: an arma::Mat<T> containing the training data and an arma::Mat<T> containing the test data.

arma::mat input = loadData();
auto splitResult = Split(input, 0.2);
Parameters
inputInput dataset to split.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).
Returns
std::tuple containing trainData (arma::Mat<T>) and testData (arma::Mat<T>).

Definition at line 401 of file split_data.hpp.

References Split().

◆ Split() [5/8]

void mlpack::data::Split ( const FieldType &  input,
const arma::field< T > &  inputLabel,
FieldType &  trainData,
arma::field< T > &  trainLabel,
FieldType &  testData,
arma::field< T > &  testLabel,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset and labels, split into a training set and test set.

Example usage below. This overload places the split dataset into the four output parameters given (trainData, testData, trainLabel, and testLabel).

The input dataset must be of type arma::field. It should have the shape - (n_rows = 1, n_cols = Number of samples, n_slices = 1).

NOTE: Here FieldType could be arma::field<arma::mat> or arma::field<arma::vec>.

arma::field<arma::mat> input = loadData();
arma::field<arma::vec> label = loadLabel();
arma::field<arma::mat> trainData;
arma::field<arma::mat> testData;
arma::field<arma::vec> trainLabel;
arma::field<arma::vec> testLabel;
math::RandomSeed(100); // Set the seed if you like.
// Split the dataset into a training and test set, with 30% of the data being
// held out for the test set.
Split(input, label, trainData, testData, trainLabel, testLabel, 0.3);
Parameters
inputInput dataset to split.
inputLabelInput labels to split.
trainDataFieldType to store training data into.
testDataFieldType test data into.
trainLabelField vector to store training labels into.
testLabelField vector to store test labels into.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true.)

Definition at line 451 of file split_data.hpp.

References mlpack::util::CheckSameSizes(), and SplitHelper().

◆ Split() [6/8]

void mlpack::data::Split ( const FieldType &  input,
FieldType &  trainData,
FieldType &  testData,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset, split into a training set and test set.

Example usage below. This overload places the split dataset into the two output parameters given (trainData, testData).

The input dataset must be of type arma::field. It should have the shape - (n_rows = 1, n_cols = Number of samples, n_slices = 1)

NOTE: Here FieldType could be arma::field<arma::mat> or arma::field<arma::vec>

arma::field<arma::mat> input = loadData();
arma::field<arma::mat> trainData;
arma::field<arma::mat> testData;
math::RandomSeed(100); // Set the seed if you like.
// Split the dataset into a training and test set, with 30% of the data being
// held out for the test set.
Split(input, trainData, testData, 0.3);
Parameters
inputInput dataset to split.
trainDataFieldType to store training data into.
testDataFieldType test data into.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).

Definition at line 507 of file split_data.hpp.

References SplitHelper().

◆ Split() [7/8]

std::tuple<FieldType, FieldType, arma::field<T>, arma::field<T> > mlpack::data::Split ( const FieldType &  input,
const arma::field< T > &  inputLabel,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset and labels, split into a training set and test set.

Example usage below. This overload returns the split dataset as a std::tuple with four elements: an FieldType containing the training data, an FieldType containing the test data, an arma::field<arma::vec> containing the training labels, and an arma::field<arma::vec> containing the test labels.

The input dataset must be of type arma::field. It should have the shape - (n_rows = 1, n_cols = Number of samples, n_slices = 1)

NOTE: Here FieldType could be arma::field<arma::mat> or arma::field<arma::vec>

arma::field<arma::mat> input = loadData();
arma::field<arma::vec> label = loadLabel();
auto splitResult = Split(input, label, 0.2);
Parameters
inputInput dataset to split.
inputLabelInput labels to split.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).
Returns
std::tuple containing trainData (FieldType), testData (FieldType), trainLabel (arma::field<arma::vec>), and testLabel (arma::field<arma::vec>).

Definition at line 557 of file split_data.hpp.

References Split().

◆ Split() [8/8]

std::tuple<FieldType, FieldType> mlpack::data::Split ( const FieldType &  input,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset, split into a training set and test set.

Example usage below. This overload returns the split dataset as a std::tuple with two elements: an FieldType containing the training data and an FieldType containing the test data.

The input dataset must be of type arma::field. It should have the shape - (n_rows = 1, n_cols = Number of samples, n_slices = 1)

NOTE: Here FieldType could be arma::field<arma::mat> or arma::field<arma::vec>

arma::field<arma::mat> input = loadData();
auto splitResult = Split(input, 0.2);
Parameters
inputInput dataset to split.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true).
Returns
std::tuple containing trainData (FieldType) and testData (FieldType).

Definition at line 604 of file split_data.hpp.

References Split().

◆ SplitHelper()

void mlpack::data::SplitHelper ( const InputType &  input,
InputType &  train,
InputType &  test,
const double  testRatio,
const arma::uvec &  order = arma::uvec() 
)

This helper function splits any input data into training and testing parts.

In order to shuffle the input data before spliting, an array of shuffled indices of the input data is passed in the form of argument order.

Definition at line 27 of file split_data.hpp.

Referenced by Split().

◆ StratifiedSplit()

void mlpack::data::StratifiedSplit ( const arma::Mat< T > &  input,
const LabelsType &  inputLabel,
arma::Mat< T > &  trainData,
arma::Mat< T > &  testData,
LabelsType &  trainLabel,
LabelsType &  testLabel,
const double  testRatio,
const bool  shuffleData = true 
)

Given an input dataset and labels, stratify into a training set and test set.

It is recommended to have the input labels between the range [0, n) where n is the number of different labels. The NormalizeLabels() function in mlpack::data can be used for this. Expects labels to be of type arma::Row<> or arma::Col<>. Throws a runtime error if this is not the case. Example usage below. This overload places the stratified dataset into the four output parameters given (trainData, testData, trainLabel, and testLabel).

arma::mat input = loadData();
arma::Row<size_t> label = loadLabel();
arma::mat trainData;
arma::mat testData;
arma::Row<size_t> trainLabel;
arma::Row<size_t> testLabel;
math::RandomSeed(100); // Set the seed if you like.
// Stratify the dataset into a training and test set, with 30% of the data
// being held out for the test set.
StratifiedSplit(input, label, trainData,
testData, trainLabel, testLabel, 0.3);
Parameters
inputInput dataset to stratify.
inputLabelInput labels to stratify.
trainDataMatrix to store training data into.
testDataMatrix to store test data into.
trainLabelVector to store training labels into.
testLabelVector to store test labels into.
testRatioPercentage of dataset to use for test set (between 0 and 1).
shuffleDataIf true, the sample order is shuffled; otherwise, each sample is visited in linear order. (Default true.)

Basic idea: Let us say we have to stratify a dataset based on labels: 0 0 0 0 0 (5 0s) 1 1 1 1 1 1 1 1 1 1 1 (11 1s)

Let our test ratio be 0.2. Then, the number of 0 labels in our test set = floor(5 * 0.2) = 1. The number of 1 labels in our test set = floor(11 * 0.2) = 2.

In our first pass over the dataset, We visit each label and keep count of each label in our 'labelCounts' uvec.

We then take a second pass over the dataset. We now maintain an additional uvec 'testLabelCounts' to hold the label counts of our test set.

In this pass, when we encounter a label we check the 'testLabelCounts' uvec for the count of this label in the test set. If this count is less than the required number of labels in the test set, we add the data to the test set and increment the label count in the uvec. If this count is equal to or more than the required count in the test set, we add this data to the train set.

Based on the above steps, we get the following labels in the split set: Train set (4 0s, 9 1s) 0 0 0 0 1 1 1 1 1 1 1 1 1

Test set (1 0s, 2 1s) 0 1 1

Definition at line 103 of file split_data.hpp.

References mlpack::util::CheckSameSizes().

Referenced by Split().