--- title: "Using the koRpus Package for Text Analysis" author: "m.eik michalke" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true includes: in_header: vignette_header.html bibliography: koRpus_lit.bib csl: apa.csl abstract: > The R package `koRpus` aims to be a versatile tool for text analysis, with an emphasis on scientific research on that topic. It implements dozens of formulae to measure readability and lexical diversity. On a more basic level `koRpus` can be used as an R wrapper for third party products, like the tokenizer and POS tagger TreeTagger or language corpora of the Leipzig Corpora Collection. This vignette takes a brief tour around its core components, shows how they can be used and gives some insight on design decisions. vignette: > %\VignetteIndexEntry{Using the koRpus Package for Text Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} header_con <- file("vignette_header.html") writeLines('', header_con) close(header_con) ``` ```{r, include=FALSE, cache=FALSE} library(koRpus) # manually add tag definition, the koRpus.lang.en package might be missing koRpus::set.lang.support("kRp.POS.tags", ## tag and class definitions # en -- english list("en"=list( tag.class.def.words=matrix(c( "CC", "conjunction", "Coordinating conjunction", "CD", "number", "Cardinal number", "DT", "determiner", "Determiner", "EX", "existential", "Existential there", "FW", "foreign", "Foreign word", "IN", "preposition", "Preposition or subordinating conjunction", "IN/that", "preposition", "Preposition or subordinating conjunction", "JJ", "adjective", "Adjective", "JJR", "adjective", "Adjective, comparative", "JJS", "adjective", "Adjective, superlative", "LS", "listmarker", "List item marker", "MD", "modal", "Modal", "NN", "noun", "Noun, singular or mass", "NNS", "noun", "Noun, plural", "NP", "name", "Proper noun, singular", "NPS", "name", "Proper noun, plural", "NS", "noun", "Noun, plural", # undocumented, bug in parameter file? "PDT", "predeterminer", "Predeterminer", "POS", "possesive", "Possessive ending", "PP", "pronoun", "Personal pronoun", "PP$", "pronoun", "Possessive pronoun", "RB", "adverb", "Adverb", "RBR", "adverb", "Adverb, comparative", "RBS", "adverb", "Adverb, superlative", "RP", "particle", " Particle", "SYM", "symbol", "Symbol", "TO", "to", "to", "UH", "interjection", "Interjection", "VB", "verb", "Verb, base form of \"to be\"", "VBD", "verb", "Verb, past tense of \"to be\"", "VBG", "verb", "Verb, gerund or present participle of \"to be\"", "VBN", "verb", "Verb, past participle of \"to be\"", "VBP", "verb", "Verb, non-3rd person singular present of \"to be\"", "VBZ", "verb", "Verb, 3rd person singular present of \"to be\"", "VH", "verb", "Verb, base form of \"to have\"", "VHD", "verb", "Verb, past tense of \"to have\"", "VHG", "verb", "Verb, gerund or present participle of \"to have\"", "VHN", "verb", "Verb, past participle of \"to have\"", "VHP", "verb", "Verb, non-3rd person singular present of \"to have\"", "VHZ", "verb", "Verb, 3rd person singular present of \"to have\"", "VV", "verb", "Verb, base form", "VVD", "verb", "Verb, past tense", "VVG", "verb", "Verb, gerund or present participle", "VVN", "verb", "Verb, past participle", "VVP", "verb", "Verb, non-3rd person singular present", "VVZ", "verb", "Verb, 3rd person singular present", "WDT", "determiner", "Wh-determiner", "WP", "pronoun", "Wh-pronoun", "WP$", "pronoun", "Possessive wh-pronoun", "WRB", "adverb", "Wh-adverb" ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))), tag.class.def.punct=matrix(c( ",", "comma", "Comma", # not in guidelines "(", "punctuation", "Opening bracket", # not in guidelines ")", "punctuation", "Closing bracket", # not in guidelines ":", "punctuation", "Punctuation", # not in guidelines "``", "punctuation", "Quote", # not in guidelines "''", "punctuation", "End quote", # not in guidelines "#", "punctuation", "Punctuation", # not in guidelines "$", "punctuation", "Punctuation" # not in guidelines ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))), tag.class.def.sentc=matrix(c( "SENT", "fullstop", "Sentence ending punctuation" # not in guidelines ), ncol=3, byrow=TRUE, dimnames=list(c(),c("tag","wclass","desc"))) ) ) ) # we'll also fool hyphen() into believing "en" is an available language, # while actually using a previously hyphenated object fake.hyph.en <- new( "kRp.hyph.pat", lang="en", pattern=matrix( c(".im5b", ".imb", "0050"), ncol=3, dimnames=list(c(), c("orig", "char", "nums")) ) ) set.hyph.support(list("en"=fake.hyph.en)) ``` ```{r, set-options, echo=FALSE, cache=FALSE} options(width=85) ``` # What is koRpus? Work on `koRpus` started in February 2011, primarily with the goal in mind to examine how similar different texts are. Since then, it quickly grew into an R package which implements dozens of formulae for readability and lexical diversity, and wrappers for language corpus databases and a tokenizer/POS tagger. # Recommendations ## TreeTagger At the very beginning of almost every analysis with this package, the text you want to examine has to be sliced into its components, and the components must be identified and named. That is, it has to be split into its semantic parts (tokens), words, numbers, punctuation marks. After that, each token will be tagged regarding its part-of-speech (POS). For both of these steps, `koRpus` can use the third party software [TreeTagger](https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) [@schmid_TT_1994]. Especially for Windows users installation of TreeTagger might be a little more complex -- e.g., it depends on Perl^[For a free implementation try ], and you need a tool to extract .tar.gz archives.^[Like ] Detailed installations instructions are beyond the scope of this vignette. If you don't want to use TreeTagger, `koRpus` provides a simple tokenizer of its own called `tokenize()`. While the tokenizing itself works quite well, `tokenize()` is not as elaborate as is TreeTagger when it comes to POS tagging, as it can merely tell words from numbers, punctuation and abbreviations. Although this is sufficient for most readability formulae, you can't evaluate word classes in detail. If that's what you want, a TreeTagger installation is needed. ## Word lists Some of the readability formulae depend on special word lists [like @bormuth_cloze_1968; @dale_formula_1948; @spache_new_1953]. For copyright reasons these lists are not included as of now. This means, as long as you don't have copies of these lists, you can't calculate these particular measures, but of course all others. The expected format to use a list with this package is a simple text file with one word per line, preferably in UTF-8 encoding. ## Language corpora The frequency analysis functions in this package can look up how often each word in a text is used in its language, given that a corpus database is provided. Databases in Celex format are support, as is the Leipzig Corpora Collection [@quasthoff_LCC_2006] file format. To use such a database with this package, you simply need to download one of the .zip/.tar files. ## Translated Human Rights Declaration If you want to estimate the language of a text, reference texts in known languages are needed. In `koRpus`, the [Universal Declaration of Human Rights with its more than 350 translations](https://www.unicode.org/udhr/downloads.html) is used. # A sample session From now on it is assumed that the above requirements are correctly installed and working. If an optional component is used it will be noted. Further, we'll need a sample text to analyze. We'll use the section on [defense mechanisms of Phasmatodea](https://en.wikipedia.org/wiki/Phasmatodea\#Defense\_mechanisms) from Wikipedia for this purpose. ## Loading a language package In order to do some analysis, you need to load a language support package for each language you would like to work with. For instance, in this vignette we're analyzing an English sample text. Language support packages for `koRpus` are named `koRpus.lang.**`, where `**` is a two-character ID for the respective language, like `en` for English.^[Unfortunately, these language packages did not get the approval of the CRAN maintainers and are officially hosted at (https://undocumeantit.github.io/repos/l10n/)[https://undocumeantit.github.io/repos/l10n/]. For your convenience the function `install.koRpus.lang()` can be used to easily install them anyway.] ```{r, eval=FALSE} # install the language support package install.koRpus.lang("en") # load the package library(koRpus.lang.en) ``` When `koRpus` itself is loaded, it will list you all language packages found on your system. To get a list of all installable packages, call `available.koRpus.lang()`. ## Tokenizing and POS tagging As explained earlier, splitting the text up into its basic components can be done by TreeTagger. To achieve this and have the results available in R, the function `treetag()` is used. ### `treetag()` At the very least you must provide it with the text, of course, and name the language it is written in. In addition to that you must specify where you installed TreeTagger. If you look at the package documentation you'll see that `treetag()` understands a number of options to configure TreeTagger, but in most cases using one of the built-in presets should suffice. TreeTagger comes with batch/shell scripts for installed languages, and the presets of `treetag()` are basically just R implementations of these scripts. ```{r, eval=FALSE} tagged.text <- treetag( "sample_text.txt", treetagger="manual", lang="en", TT.options=list( path="~/bin/treetagger/", preset="en" ), doc_id="sample" ) ``` ```{r, include=FALSE, cache=FALSE} tagged.text <- dget("sample_text_treetagged_dput.txt") ``` The first argument (file name) and `lang` should explain themselves. The `treetagger` option can either take the full path to one of the original TreeTagger scripts mentioned above, or the keyword "manual", which will cause the interpretation of what is defined by `TT.options`. To use a preset, just put the `path` to your local TreeTagger installation and a valid `preset` name here.^[Presets are defined in the language support packages, usually named like their respective two-character language identifier. Refer to their documentation.] The document ID is optional and can be omitted. The resulting S4 object is of a class called `kRp.text`. If you call the object directly you get a shortened view of it's main content: ```{r} tagged.text ``` Once you've come this far, i.e., having a valid object of class `kRp.text`, all following analyses should run smoothly. #### Troubleshooting If `treetag()` should fail, you should first re-run it with the extra option `debug=TRUE`. Most interestingly, that will print the contents of `sys.tt.call`, which is the TreeTagger command given to your operating system for execution. With that it should be possible to examine where exactly the erroneous behavior starts. ### Alternative: `tokenize()` If you don't need detailed word class analysis, you should be fine using `koRpus`' own function `tokenize()`. As you can see, `tokenize()` comes to the same results regarding the tokens, but is rather limited in recognizing word classes: ```{r} (tokenized.text <- tokenize( "sample_text.txt", lang="en", doc_id="sample" )) ``` ### Accessing data from `koRpus` objects For this class of objects, `koRpus` provides some comfortable methods to extract the portions you're interested in. For example, the main results are to be found in the slot `tokens`. In addition to TreeTagger's original output (token, tag and lemma) `treetag()` also automatically counts letters and assigns tokens to global word classes. To get these results as a data.frame, use the getter method `taggedText()`: ```{r} taggedText(tagged.text)[26:34,] ``` In case you want to access a subset of the data in the resulting object, e.g., only the column with number of letters or the first five rows of `tokens`, you'll be happy to know there's special `[` and `[[` methods for these kinds of objects: ```{r} head(tagged.text[["lttr"]], n=50) ``` ```{r} tagged.text[1:5,] ``` The `[` and `[[` methods are basically a useful shortcut replacements for `taggedText()`. ### Descriptive statistics All results of both `treetag()` and `tokenize()` also provide various descriptive statistics calculated from the analyzed text. You can get them by calling `describe()` on the object: ```{r, eval=FALSE} describe(tagged.text) ``` ```{r, echo=FALSE} (txt_desc <- describe(tagged.text)) txt_desc_lttr <- txt_desc[["lttr.distrib"]] ``` Amongst others, you will find several indices describing the number of characters: * `all.chars`: Counts each character, including all space characters * `normalized.space`: Like `all.chars`, but clusters of space characters (incl. line breaks) are counted only as one character * `chars.no.space`: Counts all characters except any space characters * `letters.only`: Counts only letters, excluding(!) digits (which are counted seperately as `digits`) You'll also find the number of `words` and `sentences`, as well as average word and sentence lengths, and tables describing how the word length is distributed throughout the text (`lttr.distrib`). For instance, we see that the text has `r txt_desc_lttr["num",3]` words with three letters, `r txt_desc_lttr["cum.sum",3]` with three or less, and `r txt_desc_lttr["cum.inv",3]` with more than three. The last three lines show the percentages, respectively. ## Lexical diversity (type token ratios) To analyze the lexical diversity of our text we can now simply hand over the tagged text object to the `lex.div()` method. You can call it on the object with no further arguments (like `lex.div(tagged.text)`), but in this example we'll limit the analysis to a few measures:^[For informtaion on the measures shown see @tweedie_how_1998, @mccarthy_vocd_2007, @mccarthy_mtld_2010.] ```{r, eval=FALSE} lex.div( tagged.text, measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA") ) ``` ```{r, echo=FALSE} lex.div( tagged.text, measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA"), quiet=TRUE ) ``` Let's look at some particular parts: At first we are informed of the total number of types, tokens and lemmas (if available). After that the actual results are being printed, using the package's `show()` method for this particular kind of object. As you can see, it prints the actual value of each measure before a summary of the characteristics.^[Characteristics can be looked at to examine each measure's dependency on text length. They are calculated by computing each measure repeatedly, beginning with only the first token, then adding the next, progressing until the full text was analyzed.] Some measures return more information than just their actual index value. For instance, when the Mean Segmental Type-Token Ratio is calculated, you'll be informed how much of your text was dropped and hence not examined. A small feature tool of `koRpus`, `segment.optimizer()`, automatically recommends you with a different segment size if this could decrease the number of lost tokens. By default, `lex.div()` calculates every measure of lexical diversity that was implemented. Of course this is fully configurable, e.g. to completely skip the calculation of characteristics just add the option `char=NULL`. If you're only interested in one particular measure, it might be more convenient to call the according wrapper function instead of `lex.div()`. For example, to calculate only the measures proposed by @maas_ueber_1972: ```{r} maas(tagged.text) ``` All wrapper functions have characteristics turned off by default. The following example demonstrates how to calculate and plot the classic type-token ratio with characteristics. The resulting plot shows the typical degredation of TTR values with increasing text length: ```{r, eval=FALSE} ttr.res <- TTR(tagged.text, char=TRUE) plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length") ``` ```{r, echo=FALSE} ttr.res <- TTR(tagged.text, char=TRUE, quiet=TRUE) plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length") ``` Since this package is intended for research, it is possible to directly influence all relevant values of each measure and examine the effects. For example, as mentioned before `segment.optimizer()` recommended a change of segment size for MSTTR to drop less words, which is easily done: ```{r} MSTTR(tagged.text, segment=92) ``` Please see to the documentation for more detailed information on the available measures and their references. ## Frequency analysis ### Importing language corpora data This package has rudimentary support to import corpus databases.^[The package also has a function called `read.corp.custom()` which can be used to process language corpora yourself, and store the results in an object of class `kRp.corp.freq`, which is the class returned by `read.corp.LCC()` and `read.corp.celex()` as well. That is, if you can't get any already analyzed corpus database but have a huge language corpus at hand, you can create your own frequency database. But be warned that depending on corpus size and your hardware, this might take ages. On the other hand, `read.corp.custom()` will provide inverse document frequency (idf) values for all types, which is necessary to compute tf-idf with `freq.analysis()`] That is, it can read frequency data for words into an R object and use this object for further analysis. Next to the [Celex](http://celex.mpi.nl) database format (`read.corp.celex()`), it can read the LCC flatfile format^[Actually, it unterstands two different LCC formats, both the older .zip and the newer .tar archive format.] (`read.corp.LCC()`). The latter might be of special interest, because the needed database archives can be [freely downloaded](https://wortschatz.uni-leipzig.de/en/download/). Once you've downloaded one of these archives, it can be comfortably imported: ```{r, eval=FALSE} LCC.en <- read.corp.LCC("~/downloads/corpora/eng_news_2010_1M-text.tar") ``` `read.corp.LCC()` will automatically extract the files it needs from the archive. Alernatively, you can specify the path to the unpacked archive as well. To work with the imported data directly, the tool `query()` was added to the package. It helps you to comfortably look up certain words, or ranges of interesting values: ```{r, eval=FALSE} query(LCC.en, "word", "what") ``` ``` ## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg ## 160 210 what 16396 0.000780145 780 2.892095 260759 260759 99.95362 ## rank.rel.min ## 160 99.95362 ``` ```{r, eval=FALSE} query(LCC.en, "pmio", c(780, 790)) ``` ``` ## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg ## 156 206 many 16588 0.0007892806 789 2.897077 260763 260763 99.95515 ## 157 207 per 16492 0.0007847128 784 2.894316 260762 260762 99.95477 ## 158 208 down 16468 0.0007835708 783 2.893762 260761 260761 99.95439 ## 159 209 since 16431 0.0007818103 781 2.892651 260760 260760 99.95400 ## 160 210 what 16396 0.0007801450 780 2.892095 260759 260759 99.95362 ## rank.rel.min ## 156 99.95515 ## 157 99.95477 ## 158 99.95439 ## 159 99.95400 ## 160 99.95362 ``` ### Conduct a frequency analysis We can now conduct a full frequency analysis of our text: ```{r, eval=FALSE} freq.analysis.res <- freq.analysis(tagged.text, corp.freq=LCC.en) ``` The resulting object holds a lot of information, even if no corpus data was used (i.e., `corp.freq=NULL`). To begin with, it contains the two slots `tokens` and `lang`, which are copied from the analyzed tagged text object. In this way analysis results can always be converted back into `kRp.text` objects.^[This can easily be done by calling `as(freq.analysis.res, "kRp.text")`.] However, if corpus data was provided, the tagging results gained three new columns: ```{r, eval=FALSE} taggedText(freq.analysis.res) ``` ``` ## token tag lemma lttr [...] pmio rank.avg rank.min [...] ## 30 an DT an 2 3817 99.98735 99.98735 ## 31 attack NN attack 6 163 99.70370 99.70370 ## 32 has VBZ have 3 4318 99.98888 99.98888 ## 33 been VBN be 4 2488 99.98313 99.98313 ## 34 initiated VBN initiate 9 11 97.32617 97.32137 ## 35 ( ( ( 1 854 99.96013 99.96013 ## 36 secondary JJ secondary 9 21 98.23846 98.23674 ## 37 defense NN defense 7 210 99.77499 99.77499 ## 38 ) ) ) 1 856 99.96052 99.96052 [...] ``` Perhaps most informatively, `pmio` shows how often the respective token appears in a million tokens, according to the corpus data. Adding to this, the previously introduced slot `desc` now contains some more descriptive statistics on our text, and if we provided a corpus database, the slot `freq.analysis` lists summaries of various frequency information that was calculated. If the corpus object also provided inverse document frequency (i.e., values in column `idf`) data, `freq.analysis()` will automatically compute tf-idf statistics and put them in a column called `tfidf`. ### New to the `desc` slot Amongst others, the descriptives now also give easy access to character vectors with all words (`$all.words`) and all lemmata (`$all.lemmata`), all tokens sorted^[This sorting depends on proper POS-tagging, so this will only contain useful data if you used `treetag()` instead of `tokenize()`.] into word classes (e.g., all verbs in `$classes$verb`), or the number of words in each sentece: ```{r, eval=FALSE} describe(freq.analysis.res)[["sentc.length"]] ``` ``` ## [1] 34 10 37 16 44 31 14 31 34 23 17 43 40 47 22 19 65 29 ``` As a practical example, the list `$classes` has proven to be very helpful to debug the results of TreeTagger, which is remarkably accurate, but of course not free from making a mistake now and then. By looking through `$classes`, where all tokens are grouped regarding to the global word class TreeTagger attributed to it, at least obvious errors (like names mistakenly taken for a pronoun) are easily found:^[And can then be corrected by using the function `correct.tag()`] ```{r, eval=FALSE} describe(freq.analysis.res)$classes ``` ``` ## $conjunction ## [1] "both" "and" "and" "and" "and" "or" "or" "and" "and" "or" ## [11] "and" "or" "and" "or" "and" "and" "and" "and" ## ## $number ## [1] "20" "one" ## ## $determiner ## [1] "an" "the" "an" "The" "the" "the" "some" ## [8] "that" "Some" "the" "a" "a" "a" "the" ## [15] "that" "the" "the" "Another" "which" "the" "a" ## [22] "that" "a" "The" "a" "the" "that" "a" [...] ``` ## Readability The package comes with implementations of several readability formulae. Some of them depend on the number of syllables in the text.^[Whether this is the case can be looked up in the documentation.] To achieve this, the method `hyphen()` takes objects of class `kRp.text` and applies an hyphenation algorithm [@liang_word_1983] to each word. This algorithm was originally developed for automatic word hyphenation in $\LaTeX$, and is gracefully misused here to fulfill a slightly different service.^[The `hyphen()` method was originally implemented as part of the `koRpus` package, but was later split off into its own package called `sylly`.] ```{r, eval=FALSE} (hyph.txt.en <- hyphen(tagged.text)) ``` ```{r, include=FALSE, cache=FALSE} hyph.txt.en <- dget("sample_text_hyphenated_dput.txt") ``` ```{r} hyph.txt.en ``` This seperate hyphenation step can actually be skipped, as `readability()` will do it automatically if needed. But similar to TreeTagger, `hyphen()` will most likely not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is underestimates the real number of syllables in a text. This, however, would of course affect the results of several readability formulae. So, the more accurate the end results should be, the less you should rely on the automatic hyphenation alone. But it sure is a good starting point, for there is a method called `correct.hyph()` to help you clean these results of errors later on. The most straight forward way to do this is to call `hyphenText(hyph.txt.en)`, which will get you a data frame with two colums, `word` (the hyphenated words) and `syll` (the number of syllables), in a spread sheet editor:^[For example, this can be comfortably done with RKWard: ] ```{r} head(hyphenText(hyph.txt.en)) ``` You can then manually correct wrong hyphenations by removing or inserting "-" as hyphenation indicators, and call `correct.hyph()` without further arguments, which will cause it to recount all syllables: ```{r, eval=FALSE} hyph.txt.en <- correct.hyph(hyph.txt.en) ``` But the method can also be used to alter entries directly, which might be simpler and cleaner than manual changes: ```{r, eval=FALSE} hyph.txt.en <- correct.hyph(hyph.txt.en, word="mech-a-nisms", hyphen="mech-a-ni-sms") ``` ``` ## Changed ## ## syll word ## 2 3 mech-a-nisms ## 6 3 mech-a-nisms ## ## into ## ## syll word ## 2 4 mech-a-ni-sms ## 6 4 mech-a-ni-sms ``` The hyphenated text object can now be given to `readability()`, to calculate the measures of interest:^[Please note that as of version 0.04-18, the correctness of some of these calculations has not been extensively validated yet. The package was released nonetheless, also to find outstanding bugs in the implemented measures. Any information on the validity of its results is very welcome!] ```{r, eval=FALSE} readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en) ``` ```{r, echo=FALSE} suppressWarnings(readbl.txt <- readability(tagged.text, hyphen=hyph.txt.en)) ``` Similar to `lex.div()`, by default `readability()` calculates almost^[Measures which rely on word lists will be skipped if no list is provided.] all available measures: ```{r} readbl.txt ``` To get a more condensed overview of the results try the `summary()` method: ```{r} summary(readbl.txt) ``` The `summary()` method supports an additional flat format, which basically turns the table into a named numeric vector, using the raw values (because all indices have raw values, but only a few more than that). This format comes very handy when you want to use the output in further calculations: ```{r} summary(readbl.txt, flat=TRUE) ``` If you're interested in a particular formula, again a wrapper function might be more convenient: ```{r} flesch.res <- flesch(tagged.text, hyphen=hyph.txt.en) lix.res <- LIX(tagged.text) # LIX doesn't need syllable count lix.res ``` ### Readability from numeric data It is possible to calculate the readability measures from the relevant key values directly, rather than analyze an actual text, by using `readability.num()` instead of `readability()`. If you need to reanalyze a particular text, this can be considerably faster. Therefore, all objects returned by `readability()` can directly be fed to `readability.num()`, since all relevant data is present in the `desc` slot. ## Language detection Another feature of this package is the detection of the language a text was (most probably) written in. This is done by gzipping reference texts in known languages, gzipping them again with addition of a small sample of the text in unknown language, and determining the case where the additional sample causes the smallest increase in file size [as described in @benedetto_gzip_2002]. By default, the compressed objects will be created in memory only. To use the function `guess.lang()`, you first need to download the reference material. In this implementation, the Universal Declaration of Human Rights in unicode formatting is used, because the document holds the world record of beeing the text translated into the most languages, and is [publicly available](https://www.unicode.org/udhr/downloads.html). Please get the zipped archive with all translations in .txt format. You can, but don't have to unzip the archive. The text to find the language of must also be in a unicode .txt file: ```{r, eval=FALSE} guessed <- guess.lang( file.path(find.package("koRpus"),"tests","testthat","sample_text.txt"), udhr.path="~/downloads/udhr_txt.zip" ) summary(guessed) ``` ``` ## Estimated language: English ## Identifier: eng ## Region: Europe ## ## 435 different languages were checked. ## ## Distribution of compression differences: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 136.0 168.0 176.0 190.7 184.0 280.0 ## ## SD: 38.21 ## ## Top 5 guesses: ## name iso639-3 bcp47 region diff diff.std ## 1 English eng en Europe 136 -1.430827 ## 2 Scots sco sco Europe 136 -1.430827 ## 3 Pidgin, Nigerian pcm pcm Africa 144 -1.221473 ## 4 Catalan-Valencian-Balear cat ca Europe 152 -1.012119 ## 5 French fra fr Europe 152 -1.012119 ## ## Last 5 guesses: ## name iso639-3 bcp47 region diff diff.std ## 431 Burmese mya my Asia 280 2.337547 ## 432 Shan shn shn Asia 280 2.337547 ## 433 Tamil tam ta Asia 280 2.337547 ## 434 Vietnamese (Han nom) vie vi-Hani Asia 280 2.337547 ## 435 Chinese, Yue yue yue Asia 280 2.337547 ``` # Extending `koRpus` The language support of this package has a modular design. There are some pre-built language packages in [the `l10n` repository](https://undocumeantit.github.io/repos/), and with a little effort you should be able to add new languages yourself. You need the package sources for this, then basically you will have to add a new file to it and rebuild/reinstall the package. More details on this topic can be found in `inst/README.languages`. Once you got a new language to work with `koRpus`, I'd be happy to include your module in the official distribution. # Analyzing full corpora Despite its name, the scope of `koRpus` is single texts. If you would like to do analysis an a full corpus of texts, have a look at the [plugin package `tm.plugin.koRpus`](https://github.com/unDocUMeantIt/tm.plugin.koRpus). # Acknowledgements The APA style used in this vignette was kindly provided by the [CSL project](https://citationstyles.org), licensed under [Creative Commons Attribution-ShareAlike 3.0 Unported license](https://creativecommons.org/licenses/by-sa/3.0/). # References