Functionality of the fastText R package

Lampros Mouselimis

2021-05-25

This vignette explains the functionality of the fastText R package. This R package is an interface to the fasttext library for efficient learning of word representations and sentence classification. The following functions are included,

fastText
fasttext_interface Interface for the fasttext library
plot_progress_logs Plot the progress of loss, learning-rate and word-counts
printAnalogiesUsage Print Usage Information when the command equals to ‘analogies’
printDumpUsage Print Usage Information when the command equals to ‘dump’
printNNUsage Print Usage Information when the command equals to ‘nn’
printPredictUsage Print Usage Information when the command equals to ‘predict’ or ‘predict-prob’
printPrintNgramsUsage Print Usage Information when the command equals to ‘print-ngrams’
printPrintSentenceVectorsUsage Print Usage Information when the command equals to ‘print-sentence-vectors’
printPrintWordVectorsUsage Print Usage Information when the command equals to ‘print-word-vectors’
printQuantizeUsage Print Usage Information when the command equals to ‘quantize’
printTestLabelUsage Print Usage Information when the command equals to ‘test-label’
printTestUsage Print Usage Information when the command equals to ‘test’
printUsage Print Usage Information for all parameters
print_parameters Print the parameters for a specific command


I’ll explain each function separately based on example data. More information can be found in the package documentation.



fasttext_interface


This function allows the user to run the various methods included in the fasttext library from within R. The data that I’ll use in the following code snippets can be downloaded as a .zip file (named as fastText_data) from my Github repository. The user should then unzip the file and make the extracted folder his / hers default directory (using the base R function setwd()) before running the following code chunks.



setwd('fastText_data')                  # make the extracted data the default directory

#------
# cbow
#------

library(fastText)

list_params = list(command = 'cbow', 
                   lr = 0.1, 
                   dim = 50,
                   input = "example_text.txt",
                   output = file.path(tempdir(), 'word_vectors'), 
                   verbose = 2, 
                   thread = 1)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'cbow_logs.txt'),
                         MilliSecs = 5,
                         remove_previous_file = TRUE,
                         print_process_time = TRUE)

Read 0M words
Number of words:  8
Number of labels: 0
Progress: 100.0% words/sec/thread:    2933 lr:  0.000000 loss:  4.060542 ETA:   0h 0m

time to complete : 3.542332 secs 


The data is saved in the specified tempdir() folder for illustration purposes. The user is advised to specify his / her own folder.



#-----------
# supervised
#-----------

list_params = list(command = 'supervised', 
                   lr = 0.1,
                   dim = 50,
                   input = file.path("cooking.stackexchange", "cooking.train"),
                   output = file.path(tempdir(), 'model_cooking'), 
                   verbose = 2, 
                   thread = 4)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'sup_logs.txt'),
                         MilliSecs = 5,
                         remove_previous_file = TRUE,
                         print_process_time = TRUE)


Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:   63282 lr:  0.000000 loss: 10.049338 ETA:   0h 0m

time to complete : 3.449003 secs 


The user has here also the option to plot the progress of loss, learning-rate and word-counts,


res = plot_progress_logs(path = file.path(tempdir(), 'sup_logs.txt'), 
                         plot = TRUE)

dim(res)



The verbosity for the logs-file depends on,

parameters.


The next command can be utilized to ‘predict’ new data based on the output model,



#-------------------
# 'predict' function
#-------------------

list_params = list(command = 'predict',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'), 
                   k = 1,
                   th = 0.0)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'predict_valid.txt'))


These output predictions will be of the following form ’__label__food-safety’ , where each line will represent a new label (number of lines of the input data must match the number of lines of the output data). With the ‘predict-prob’ command someone can obtain the probabilities of the labels as well,



#------------------------
# 'predict-prob' function
#------------------------

list_params = list(command = 'predict-prob',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'), 
                   k = 1,
                   th = 0.0)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'predict_valid_prob.txt'))


Using ‘predict-prob’ the output predictions will be of the following form ’__label__baking 0.0282927’


Once the model was trained, someone can evaluate it by computing the precision and recall at ‘k’ on a test set. The ‘test’ command just prints the metrics in the R session,



#----------------
# 'test' function
#----------------


list_params = list(command = 'test',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'),
                   k = 1,
                   th = 0.0)

res = fasttext_interface(list_params)

N   3000
P@1 0.138
R@1 0.060


whereas the ‘test-label’ command allows the user to save,



#----------------------
# 'test-label' function
#----------------------

list_params = list(command = 'test-label',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   test_data = file.path('cooking.stackexchange', 'cooking.valid'),
                   k = 1,
                   th = 0.0)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'test_label_valid.txt'))


the output to the ‘test_label_valid.txt’ file, which includes the ‘Precision’ and ‘Recall’ for each unique label on the data set (‘cooking.stackexchange.txt’). That means the number of rows of the ‘test_label_valid.txt’ must be equal to the unique labels in the ‘cooking.stackexchange.txt’ data set. This can be verified using the following code snippet,



st_dat = read.delim(file.path("cooking.stackexchange", "cooking.stackexchange.txt"), 
                    stringsAsFactors = FALSE)

res_stackexch = unlist(lapply(1:nrow(st_dat), function(y)
  
  strsplit(st_dat[y, ], " ")[[1]][which(sapply(strsplit(st_dat[y, ], " ")[[1]], function(x)
    
    substr(x, 1, 9) == "__label__") == T)])
)

test_label_valid = read.table(file.path(tempdir(), 'test_label_valid.txt'), 
                              quote="\"", comment.char="")

# number of unique labels of data equal to the rows of the 'test_label_valid.txt' file
length(unique(res_stackexch)) == nrow(test_label_valid)             
[1] TRUE

head(test_label_valid)

        V1 V2       V3        V4 V5       V6     V7 V8       V9                    V10
1 F1-Score  : 0.234244 Precision  : 0.139535 Recall  : 0.729167        __label__baking
2 F1-Score  : 0.227746 Precision  : 0.132571 Recall  : 0.807377   __label__food-safety
3 F1-Score  : 0.058824 Precision  : 0.750000 Recall  : 0.030612 __label__substitutions
4 F1-Score  : 0.000000 Precision  : -------- Recall  : 0.000000     __label__equipment
5 F1-Score  : 0.017699 Precision  : 1.000000 Recall  : 0.008929         __label__bread
6 F1-Score  : 0.000000 Precision  : -------- Recall  : 0.000000       __label__chicken
.....


The user can also ‘quantize’ a supervised model to reduce its memory usage with the following command,



#---------------------
# 'quantize' function
#---------------------


list_params = list(command = 'quantize',
                   input = file.path(tempdir(), 'model_cooking.bin'),
                   output = file.path(tempdir(), 'model_cooking')) 

res = fasttext_interface(list_params)


print(list.files(tempdir(), pattern = '.ftz'))
[1] "model_cooking.ftz"


The quantize function is currenlty (as of 01/02/2019) single-threaded.


Based on the ‘queries.txt’ text file the user can save the word vectors to a file using the following command ( one vector per line ),



#----------------------------
# print-word-vectors function
#----------------------------

list_params = list(command = 'print-word-vectors',
                   model = file.path(tempdir(), 'model_cooking.bin'))

res = fasttext_interface(list_params,
                         path_input = 'queries.txt',
                         path_output = file.path(tempdir(), 'word_vecs_queries.txt'))


To compute vector representations of sentences or paragraphs use the following command,



#--------------------------------
# print-sentence-vectors function
#--------------------------------

library(fastText)

list_params = list(command = 'print-sentence-vectors',
                   model = file.path(tempdir(), 'model_cooking.bin'))

res = fasttext_interface(list_params,
                         path_input = 'text_sentence.txt',
                         path_output = file.path(tempdir(), 'word_sentence_queries.txt'))


Be aware that for the ‘print-sentence-vectors’ the ‘word_sentence_queries.txt’ file should consist of sentences of the following form,



How much does potato starch affect a cheese sauce recipe</s>
Dangerous pathogens capable of growing in acidic environments</s>
How do I cover up the white spots on my cast iron stove</s>
How do I cover up the white spots on my cast iron stove</s>
Michelin Three Star Restaurant but if the chef is not there</s>
......


Therefore each line should end in EOS (end of sentence ) and that if at the end of the file a newline exists then the function will return an additional word vector. Thus the user should make sure that the input file does not include empty lines at the end of the file.


The ‘print-ngrams’ command prints the n-grams of a word in the R session or saves the n-grams to a file. But first the user should save the model and word-vectors with n-grams enabled (minn, maxn parameters)



#----------------------------------------
# 'skipgram' function with n-gram enabled
#----------------------------------------

list_params = list(command = 'skipgram', 
                   lr = 0.1,
                   dim = 50,
                   input = "example_text.txt",
                   output = file.path(tempdir(), 'word_vectors'),
                   verbose = 2, 
                   thread = 1,
                   minn = 2, 
                   maxn = 2)

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'skipgram_logs.txt'),
                         MilliSecs = 5)


#-----------------------
# 'print-ngram' function
#-----------------------

list_params = list(command = 'print-ngrams',
                   model = file.path(tempdir(), 'word_vectors.bin'),
                   word = 'word')

# save output to file
res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'ngrams.txt'))
                         
#  print output to console
res = fasttext_interface(list_params, 
                         path_output = "")      


# truncated output for the 'word' query
#--------------------------------------

<w -0.00983 0.00178 0.01929 0.00994 ......
wo 0.01114 0.01570 0.01294 0.01284 ......
or -0.01035 -0.01585 -0.01671 -0.00043 ......
rd 0.01840 -0.00378 -0.01468 0.01088 ......
d> 0.00749 0.00720 0.01171 0.01258  ......


The ‘nn’ command returns the nearest neighbors for a specific word based on the input model,



#--------------
# 'nn' function
#--------------

list_params = list(command = 'nn',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   k = 5,
                   query_word = 'sauce')

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'nearest.txt'))

# 'nearest.txt'
#--------------

rice 0.804595
</s> 0.799858
Vide 0.78893
store 0.788918
cheese 0.785977


The ‘analogies’ command works for triplets of words (separated by whitespace) and returns ‘k’ rows for each line (triplet) of the input file (separated by an empty line),



#---------------------
# 'analogies' function
#---------------------

list_params = list(command = 'analogies',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   k = 5)

res = fasttext_interface(list_params, 
                         path_input = 'analogy_queries.txt', 
                         path_output = file.path(tempdir(), 'analogies_output.txt'))


# 'analogies_output.txt'
#-----------------------

batter 0.857213
I 0.854491
recipe? 0.851498
substituted 0.845269
flour 0.842508

covered 0.808651
calls 0.801348
fresh 0.800051
cold 0.797468
always 0.793695

.............


Finally, the ‘dump’ function takes as ‘option’ one of the ‘args’, ‘dict’, ‘input’ or ‘output’ and dumps the output to a text file,



#--------------
# dump function
#--------------

list_params = list(command = 'dump',
                   model = file.path(tempdir(), 'model_cooking.bin'),
                   option = 'args')

res = fasttext_interface(list_params, 
                         path_output = file.path(tempdir(), 'dump_data.txt'), 
                         remove_previous_file = TRUE)
                         

# 'dump_data.txt'
#----------------

dim 50
ws 5
epoch 5
minCount 1
neg 5
wordNgrams 1
loss softmax
model sup
bucket 0
minn 0
maxn 0
lrUpdateRate 100
t 0.00010