Definition of the TfIdfEncodingPolicy class. More...

Public Types

enum  TfTypes
{
  BINARY
,
  RAW_COUNT
,
  TERM_FREQUENCY
,
  SUBLINEAR_TF

}
 Enum class used to identify the type of the term frequency statistics. More...

 

Public Member Functions

 TfIdfEncodingPolicy (const TfTypes tfType=TfTypes::RAW_COUNT, const bool smoothIdf=true)
 Construct this using the term frequency type and the inverse document frequency type. More...

 
template
<
typename
MatType
>
void Encode (MatType &output, const size_t value, const size_t line, const size_t)
 The function performs the TfIdf encoding algorithm i.e. More...

 
template
<
typename
ElemType
>
void Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t)
 The function performs the TfIdf encoding algorithm i.e. More...

 
const std::vector< size_t > & LinesSizes () const
 Return the lines sizes. More...

 
std::vector< size_t > & LinesSizes ()
 Modify the lines sizes. More...

 
const std::unordered_map< size_t, size_t > & NumContainingStrings () const
 Get the number of containing strings depending on the given token. More...

 
std::unordered_map< size_t, size_t > & NumContainingStrings ()
 Modify the number of containing strings depending on the given token. More...

 
void PreprocessToken (const size_t line, const size_t, const size_t value)
 
void Reset ()
 Clear the necessary internal variables. More...

 
template
<
typename
Archive
>
void serialize (Archive &ar, const uint32_t)
 Serialize the class to the given archive. More...

 
bool SmoothIdf () const
 Determine the idf algorithm type (whether it's smooth or not). More...

 
bool & SmoothIdf ()
 Modify the idf algorithm type (whether it's smooth or not). More...

 
TfTypes TfType () const
 Return the term frequency type. More...

 
TfTypesTfType ()
 Modify the term frequency type. More...

 
const std::vector< std::unordered_map< size_t, size_t > > & TokensFrequences () const
 Return token frequencies. More...

 
std::vector< std::unordered_map< size_t, size_t > > & TokensFrequences ()
 Modify token frequencies. More...

 

Static Public Member Functions

template
<
typename
MatType
>
static void InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...

 
template
<
typename
ElemType
>
static void InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize)
 The function initializes the output matrix. More...

 

Detailed Description

Definition of the TfIdfEncodingPolicy class.

TfIdfEncodingPolicy is used as a helper class for StringEncoding.

Tf-idf is a weighting scheme that takes into account the importance of encoded tokens. The tf-idf statistics is equal to term frequency (tf) multiplied by inverse document frequency (idf). The encoder assigns the corresponding tf-idf value to each token. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.

Definition at line 35 of file tf_idf_encoding_policy.hpp.

Member Enumeration Documentation

◆ TfTypes

enum TfTypes
strong

Enum class used to identify the type of the term frequency statistics.

The present implementation supports the following types: BINARY Term frequency equals 1 if the row contains the encoded token and 0 otherwise. RAW_COUNT Term frequency equals the number of times when the encoded token occurs in the row. TERM_FREQUENCY Term frequency equals the number of times when the encoded token occurs in the row divided by the total number of tokens in the row. SUBLINEAR_TF Term frequency equals $ 1 + log(rawCount), $ where rawCount is equal to the number of times when the encoded token occurs in the row.

Enumerator
BINARY 
RAW_COUNT 
TERM_FREQUENCY 
SUBLINEAR_TF 

Definition at line 53 of file tf_idf_encoding_policy.hpp.

Constructor & Destructor Documentation

◆ TfIdfEncodingPolicy()

TfIdfEncodingPolicy ( const TfTypes  tfType = TfTypes::RAW_COUNT,
const bool  smoothIdf = true 
)
inline

Construct this using the term frequency type and the inverse document frequency type.

Parameters
tfTypeType of the term frequency statistics.
smoothIdfUsed to indicate whether to use smooth idf or not. If idf is smooth it's calculated by the following formula: $ idf(T) = \log \frac{1 + N}{1 + df(T)} + 1, $ where $ N $ is the total number of strings in the document, $ T $ is the current encoded token, $ df(T) $ equals the number of strings which contain the token. If idf isn't smooth then the following rule applies: $ idf(T) = \log \frac{N}{df(T)} + 1. $

Definition at line 75 of file tf_idf_encoding_policy.hpp.

Referenced by TfIdfEncodingPolicy::serialize().

Member Function Documentation

◆ Encode() [1/2]

void Encode ( MatType &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the column-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The token index in the line.

Definition at line 148 of file tf_idf_encoding_policy.hpp.

◆ Encode() [2/2]

void Encode ( std::vector< std::vector< ElemType >> &  output,
const size_t  value,
const size_t  line,
const size_t   
)
inline

The function performs the TfIdf encoding algorithm i.e.

it writes the encoded token to the output. The encoder writes data in the row-major order.

Overloaded function to accept vector<vector<ElemType>> as the output type.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
valueThe encoded token.
lineThe line number at which the encoding is performed.
*(index) The token index in the line.

Definition at line 180 of file tf_idf_encoding_policy.hpp.

◆ InitMatrix() [1/2]

static void InitMatrix ( MatType &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Template Parameters
MatTypeThe output matrix type.
Parameters
outputOutput matrix to store the encoded results (sp_mat or mat).
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

Definition at line 104 of file tf_idf_encoding_policy.hpp.

◆ InitMatrix() [2/2]

static void InitMatrix ( std::vector< std::vector< ElemType >> &  output,
const size_t  datasetSize,
const size_t  ,
const size_t  dictionarySize 
)
inlinestatic

The function initializes the output matrix.

The encoder writes data in the row-major order.

Overloaded function to save the result in vector<vector<ElemType>>.

Template Parameters
ElemTypeType of the output values.
Parameters
outputOutput matrix to store the encoded results.
datasetSizeThe number of strings in the input dataset.
*(maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used).
dictionarySizeThe size of the dictionary.

Definition at line 127 of file tf_idf_encoding_policy.hpp.

◆ LinesSizes() [1/2]

const std::vector<size_t>& LinesSizes ( ) const
inline

Return the lines sizes.

Definition at line 242 of file tf_idf_encoding_policy.hpp.

◆ LinesSizes() [2/2]

std::vector<size_t>& LinesSizes ( )
inline

Modify the lines sizes.

Definition at line 244 of file tf_idf_encoding_policy.hpp.

◆ NumContainingStrings() [1/2]

const std::unordered_map<size_t, size_t>& NumContainingStrings ( ) const
inline

Get the number of containing strings depending on the given token.

Definition at line 230 of file tf_idf_encoding_policy.hpp.

◆ NumContainingStrings() [2/2]

std::unordered_map<size_t, size_t>& NumContainingStrings ( )
inline

Modify the number of containing strings depending on the given token.

Definition at line 236 of file tf_idf_encoding_policy.hpp.

◆ PreprocessToken()

void PreprocessToken ( const size_t  line,
const size_t  ,
const size_t  value 
)
inline

Definition at line 202 of file tf_idf_encoding_policy.hpp.

◆ Reset()

void Reset ( )
inline

Clear the necessary internal variables.

Definition at line 84 of file tf_idf_encoding_policy.hpp.

◆ serialize()

void serialize ( Archive &  ar,
const uint32_t   
)
inline

◆ SmoothIdf() [1/2]

bool SmoothIdf ( ) const
inline

Determine the idf algorithm type (whether it's smooth or not).

Definition at line 252 of file tf_idf_encoding_policy.hpp.

◆ SmoothIdf() [2/2]

bool& SmoothIdf ( )
inline

Modify the idf algorithm type (whether it's smooth or not).

Definition at line 254 of file tf_idf_encoding_policy.hpp.

◆ TfType() [1/2]

TfTypes TfType ( ) const
inline

Return the term frequency type.

Definition at line 247 of file tf_idf_encoding_policy.hpp.

◆ TfType() [2/2]

TfTypes& TfType ( )
inline

Modify the term frequency type.

Definition at line 249 of file tf_idf_encoding_policy.hpp.

◆ TokensFrequences() [1/2]

const std::vector<std::unordered_map<size_t, size_t> >& TokensFrequences ( ) const
inline

Return token frequencies.

Definition at line 222 of file tf_idf_encoding_policy.hpp.

◆ TokensFrequences() [2/2]

std::vector<std::unordered_map<size_t, size_t> >& TokensFrequences ( )
inline

Modify token frequencies.

Definition at line 224 of file tf_idf_encoding_policy.hpp.


The documentation for this class was generated from the following file: