Definition of the TfIdfEncodingPolicy class. More...
Public Types | |
enum | TfTypes { BINARY , RAW_COUNT , TERM_FREQUENCY , SUBLINEAR_TF } |
Enum class used to identify the type of the term frequency statistics. More... | |
Public Member Functions | |
TfIdfEncodingPolicy (const TfTypes tfType=TfTypes::RAW_COUNT, const bool smoothIdf=true) | |
Construct this using the term frequency type and the inverse document frequency type. More... | |
template < typename MatType > | |
void | Encode (MatType &output, const size_t value, const size_t line, const size_t) |
The function performs the TfIdf encoding algorithm i.e. More... | |
template < typename ElemType > | |
void | Encode (std::vector< std::vector< ElemType >> &output, const size_t value, const size_t line, const size_t) |
The function performs the TfIdf encoding algorithm i.e. More... | |
const std::vector< size_t > & | LinesSizes () const |
Return the lines sizes. More... | |
std::vector< size_t > & | LinesSizes () |
Modify the lines sizes. More... | |
const std::unordered_map< size_t, size_t > & | NumContainingStrings () const |
Get the number of containing strings depending on the given token. More... | |
std::unordered_map< size_t, size_t > & | NumContainingStrings () |
Modify the number of containing strings depending on the given token. More... | |
void | PreprocessToken (const size_t line, const size_t, const size_t value) |
void | Reset () |
Clear the necessary internal variables. More... | |
template < typename Archive > | |
void | serialize (Archive &ar, const uint32_t) |
Serialize the class to the given archive. More... | |
bool | SmoothIdf () const |
Determine the idf algorithm type (whether it's smooth or not). More... | |
bool & | SmoothIdf () |
Modify the idf algorithm type (whether it's smooth or not). More... | |
TfTypes | TfType () const |
Return the term frequency type. More... | |
TfTypes & | TfType () |
Modify the term frequency type. More... | |
const std::vector< std::unordered_map< size_t, size_t > > & | TokensFrequences () const |
Return token frequencies. More... | |
std::vector< std::unordered_map< size_t, size_t > > & | TokensFrequences () |
Modify token frequencies. More... | |
Static Public Member Functions | |
template < typename MatType > | |
static void | InitMatrix (MatType &output, const size_t datasetSize, const size_t, const size_t dictionarySize) |
The function initializes the output matrix. More... | |
template < typename ElemType > | |
static void | InitMatrix (std::vector< std::vector< ElemType >> &output, const size_t datasetSize, const size_t, const size_t dictionarySize) |
The function initializes the output matrix. More... | |
Definition of the TfIdfEncodingPolicy class.
TfIdfEncodingPolicy is used as a helper class for StringEncoding.
Tf-idf is a weighting scheme that takes into account the importance of encoded tokens. The tf-idf statistics is equal to term frequency (tf) multiplied by inverse document frequency (idf). The encoder assigns the corresponding tf-idf value to each token. The order in which the tokens are labeled is defined by the dictionary used by the StringEncoding class. The encoder writes data either in the column-major order or in the row-major order depending on the output data type.
Definition at line 35 of file tf_idf_encoding_policy.hpp.
|
strong |
Enum class used to identify the type of the term frequency statistics.
The present implementation supports the following types: BINARY Term frequency equals 1 if the row contains the encoded token and 0 otherwise. RAW_COUNT Term frequency equals the number of times when the encoded token occurs in the row. TERM_FREQUENCY Term frequency equals the number of times when the encoded token occurs in the row divided by the total number of tokens in the row. SUBLINEAR_TF Term frequency equals where rawCount is equal to the number of times when the encoded token occurs in the row.
Enumerator | |
---|---|
BINARY | |
RAW_COUNT | |
TERM_FREQUENCY | |
SUBLINEAR_TF |
Definition at line 53 of file tf_idf_encoding_policy.hpp.
|
inline |
Construct this using the term frequency type and the inverse document frequency type.
tfType | Type of the term frequency statistics. |
smoothIdf | Used to indicate whether to use smooth idf or not. If idf is smooth it's calculated by the following formula: where is the total number of strings in the document, is the current encoded token, equals the number of strings which contain the token. If idf isn't smooth then the following rule applies: |
Definition at line 75 of file tf_idf_encoding_policy.hpp.
Referenced by TfIdfEncodingPolicy::serialize().
|
inline |
The function performs the TfIdf encoding algorithm i.e.
it writes the encoded token to the output. The encoder writes data in the column-major order.
MatType | The output matrix type. |
output | Output matrix to store the encoded results (sp_mat or mat). |
value | The encoded token. |
line | The line number at which the encoding is performed. |
* | (index) The token index in the line. |
Definition at line 148 of file tf_idf_encoding_policy.hpp.
|
inline |
The function performs the TfIdf encoding algorithm i.e.
it writes the encoded token to the output. The encoder writes data in the row-major order.
Overloaded function to accept vector<vector<ElemType>> as the output type.
ElemType | Type of the output values. |
output | Output matrix to store the encoded results. |
value | The encoded token. |
line | The line number at which the encoding is performed. |
* | (index) The token index in the line. |
Definition at line 180 of file tf_idf_encoding_policy.hpp.
|
inlinestatic |
The function initializes the output matrix.
The encoder writes data in the row-major order.
MatType | The output matrix type. |
output | Output matrix to store the encoded results (sp_mat or mat). |
datasetSize | The number of strings in the input dataset. |
* | (maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used). |
dictionarySize | The size of the dictionary. |
Definition at line 104 of file tf_idf_encoding_policy.hpp.
|
inlinestatic |
The function initializes the output matrix.
The encoder writes data in the row-major order.
Overloaded function to save the result in vector<vector<ElemType>>.
ElemType | Type of the output values. |
output | Output matrix to store the encoded results. |
datasetSize | The number of strings in the input dataset. |
* | (maxNumTokens) The maximum number of tokens in the strings of the input dataset (not used). |
dictionarySize | The size of the dictionary. |
Definition at line 127 of file tf_idf_encoding_policy.hpp.
|
inline |
Return the lines sizes.
Definition at line 242 of file tf_idf_encoding_policy.hpp.
|
inline |
Modify the lines sizes.
Definition at line 244 of file tf_idf_encoding_policy.hpp.
|
inline |
Get the number of containing strings depending on the given token.
Definition at line 230 of file tf_idf_encoding_policy.hpp.
|
inline |
Modify the number of containing strings depending on the given token.
Definition at line 236 of file tf_idf_encoding_policy.hpp.
|
inline |
Definition at line 202 of file tf_idf_encoding_policy.hpp.
|
inline |
Clear the necessary internal variables.
Definition at line 84 of file tf_idf_encoding_policy.hpp.
|
inline |
Serialize the class to the given archive.
Definition at line 260 of file tf_idf_encoding_policy.hpp.
References TfIdfEncodingPolicy::BINARY, Log::Fatal, TfIdfEncodingPolicy::RAW_COUNT, TfIdfEncodingPolicy::SUBLINEAR_TF, TfIdfEncodingPolicy::TERM_FREQUENCY, and TfIdfEncodingPolicy::TfIdfEncodingPolicy().
|
inline |
Determine the idf algorithm type (whether it's smooth or not).
Definition at line 252 of file tf_idf_encoding_policy.hpp.
|
inline |
Modify the idf algorithm type (whether it's smooth or not).
Definition at line 254 of file tf_idf_encoding_policy.hpp.
|
inline |
Return the term frequency type.
Definition at line 247 of file tf_idf_encoding_policy.hpp.
|
inline |
Modify the term frequency type.
Definition at line 249 of file tf_idf_encoding_policy.hpp.
|
inline |
Return token frequencies.
Definition at line 222 of file tf_idf_encoding_policy.hpp.
|
inline |
Modify token frequencies.
Definition at line 224 of file tf_idf_encoding_policy.hpp.