
© [Christoph Carl Kling](https://commons.wikimedia.org/wiki/File:Topic_detection_in_a_document-word_matrix.gif) / [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)
- transform texts such that original can not be reconstructed
- document-term matrices
- n-grams
first twenty rows
| ngram_type | count | ngram |
| ---------: | -----: | :---- |
| 1 | 195891 | . |
| 1 | 145089 | the |
| 1 | 142295 | , |
| 1 | 134689 | to |
| 1 | 111336 | a |
| 1 | 106765 | I |
| 1 | 98259 | … |
| 1 | 85359 | and |
| 1 | 77553 | you |
| 1 | 73279 | of |
| 1 | 69851 | ! |
| 1 | 68355 | is |
| 1 | 68253 | in |
| 1 | 67921 | for |
| 1 | 58991 | : |
| 1 | 47744 | on |
| 1 | 44847 | ? |
| 1 | 40520 | my |
| 1 | 39672 | it |
| 1 | 37573 | that |
rows around 'library'
| ngram_type | count | ngram |
| ---------: | ----: | :--------- |
| 1 | 150 | POTUS |
| 1 | 150 | Seeing |
| 1 | 150 | TX |
| 1 | 150 | Unknown |
| 1 | 150 | apartment |
| 1 | 150 | audience |
| 1 | 150 | concern |
| 1 | 150 | friendship |
| 1 | 150 | hehe |
| 1 | 150 | hottest |
| 1 | 150 | **library** |
| 1 | 150 | limit |
| 1 | 150 | messages |
| 1 | 150 | pleased |
| 1 | 150 | print |
| 1 | 150 | properly |
| 1 | 150 | racism |
| 1 | 150 | 💞 |
| 1 | 149 | #free |
| 1 | 149 | Casino |
3-grams with 'library'
| ngram_type | count | ngram |
| ---------: | ----: | :--------- |
| 3 | 21 | library ! #tvtime |
| 3 | 21 | my library ! |
| 3 | 21 | to my library |
| 3 | 19 | in the library |
| 3 | 10 | at the library |
| 3 | 7 | to the library |
| 3 | 5 | the library . |
| 3 | 5 | the library and |
| 3 | 4 | library . I |
| 3 | 3 | from the library |
| 3 | 3 | the library but |
| 3 | 3 | visit the library |
| 3 | 2 | a library book |
| 3 | 2 | library is closed |
| 3 | 2 | the library , |
| 3 | 2 | the library i |
| 3 | 2 | the library in |
| 3 | 2 | the library on |
| 3 | 2 | the library this |
| 3 | 2 | the library tomorrow |
Note:
- frequent English words
- punctuation
- mixed (original) case
- hashtags (#free) and emojis (💞)
--
## Statistics

Note:
- avg. 600k distinct 1-grams per month
- avg. 3.3M distinct 2-grams per month
- avg. 6M distinct 3-grams per month
- gaps: server fault / API v1 shutdown in 2022-12
- trends: decline until 2020 → resurgence during pandemic
--
## Published Dataset

doi:[10.5281/zenodo.15736201](https://doi.org/10.5281/zenodo.15736201)
--
## Limitations
- gaps & errors
- timeframe
- post-collection deletions
- sampling