history. International Journal of Humanities and
Arts Computing, 8(1):46–64.
Cieri, C., Fiumara, J., Strassel, S., Wright, J., DiPer-
sio, D., and Liberman, M. (2020). A progress re-
port on activities at the Linguistic Data Consortium
benefitting the LREC community. In Proceedings of
the 12th Language Resources and Evaluation Con-
ference, pages 3449–3456, Marseille, France, May.
European Language Resources Association.
Dodge, J., Sap, M., Marasovic, A., Agnew, W., Ilharco,
G., Groeneveld, D., Mitchell, M., and Gardner, M.
(2021). Documenting large webtext corpora: A case
study on the colossal clean crawled corpus. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1286–
1305, Online and Punta Cana, Dominican Republic,
November. Association for Computational Linguis-
tics.
Eberhard, David M.and Simons, G. F. and Fennig,
C. D. (2021). Ethnologue: Languages of the world.
SIL International, 24 edition.
Gao, L., Biderman, S., Black, S., Golding, L.,
Hoppe, T., Foster, C., Phang, J., He, H., Thite, A.,
Nabeshima, N., et al. (2020). The pile: An 800gb
dataset of diverse text for language modeling. arXiv
preprint arXiv:2101.00027.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi,
A., Foster, C., Golding, L., Hsu, J., McDonell, K.,
Muennighoff, N., Phang, J., Reynolds, L., Tang, E.,
Thite, A., Wang, B., Wang, K., and Zou, A. (2021).
A framework for few-shot language model evalua-
tion.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan,
J. W., Wallach, H. M., III, H. D., and Craw-
ford, K. (2018). Datasheets for datasets. CoRR,
abs/1803.09010v1.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan,
J. W., Wallach, H., Iii, H. D., and Crawford, K.
(2021). Datasheets for datasets. Communications of
the ACM, 64(12):86–92.
Gehrmann, S., Adewumi, T., Aggarwal, K., Am-
manamanchi, P. S., Anuoluwapo, A., Bosselut, A.,
Chandu, K. R., Clinciu, M., Das, D., Dhole, K. D.,
et al. (2021). The gem benchmark: Natural lan-
guage generation, its evaluation and metrics. arXiv
preprint arXiv:2102.01672.
Holland, S., Hosny, A., Newman, S., Joseph, J., and
Chmielinski, K. (2018). The dataset nutrition label:
A framework to drive higher data quality standards.
Jo, E. S. and Gebru, T. (2020). Lessons from archives:
Strategies for collecting sociocultural data in ma-
chine learning. In Proceedings of the 2020 Confer-
ence on Fairness, Accountability, and Transparency,
FAT* ’20, page 306–316, New York, NY, USA. As-
sociation for Computing Machinery.
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van
Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani,
N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin,
S., Samb, S., Sagot, B., Rivera, C., Rios, A., Pa-
padimitriou, I., Osei, S., Suárez, P. O., Orife, I.,
Ogueji, K., Rubungo, A. N., Nguyen, T. Q., Müller,
M., Müller, A., Muhammad, S. H., Muhammad,
N., Mnyakeni, A., Mirzakhalov, J., Matangira, T.,
Leong, C., Lawson, N., Kudugunta, S., Jernite, Y.,
Jenny, M., Firat, O., Dossou, B. F. P., Dlamini, S.,
de Silva, N., C¸abuk Ballı, S., Biderman, S., Bat-
tisti, A., Baruwa, A., Bapna, A., Baljekar, P., Az-
ime, I. A., Awokoya, A., Ataman, D., Ahia, O., Ahia,
O., Agrawal, S., and Adeyemi, M. (2021). Quality
at a glance: An audit of web-crawled multilingual
datasets. arXiv preprint arXiv:2103.12028.
Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur,
A., von Platen, P., Patil, S., Chaumond, J., Drame,
M., Plu, J., Tunstall, L., Davison, J., Šaško, M.,
Chhablani, G., Malik, B., Brandeis, S., Le Scao, T.,
Sanh, V., Xu, C., Patry, N., McMillan-Major, A.,
Schmid, P., Gugger, S., Delangue, C., Matussi`ere,
T., Debut, L., Bekman, S., Cistac, P., Goehringer,
T., Mustar, V., Lagunas, F., Rush, A., and Wolf, T.
(2021). Datasets: A community library for natu-
ral language processing. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing: System Demonstrations, pages
175–184, Online and Punta Cana, Dominican Re-
public, November. Association for Computational
Linguistics.
Luccioni, A. and Viviano, J. (2021). What’s in the
box? an analysis of undesirable content in the Com-
mon Crawl corpus. In Proceedings of the 59th An-
nual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer-
ence on Natural Language Processing (Volume 2:
Short Papers), pages 182–189, Online, August. As-
sociation for Computational Linguistics.
Obar, J. A. and Oeldorf-Hirsch, A. (2020). The biggest
lie on the internet: ignoring the privacy policies
and terms of service policies of social networking
services. Information, Communication & Society,
23(1):128–147.
Padilla, T., Allen, L., Frost, H., Potvin, S.,
Russey Roke, E., and Varner, S. (2019). Final Re-
port — Always Already Computational: Collections
as Data, May.
Paullada, A., Raji, I. D., Bender, E. M., Denton, E.,
and Hanna, A. (2021). Data and its (dis)contents:
A survey of dataset development and use in machine
learning research. Patterns, 2(11):100336.
Prabhu, V. U. and Birhane, A. (2020). Large image
datasets: A pyrrhic win for computer vision? arXiv
preprint arXiv:2006.16923.
Pushkarna, M., Zaldivar, A., Nanas, D., Brouillet,
E., Jana, R., Kjartansson, O., Smalls, D., and
Tsai, V. (2021). Data cards playbook. https://pair-
code.github.io/datacardsplaybook/, March.
Raddick, M. J., Bracey, G., Gay, P. L., Lintott, C. J.,
Cardamone, C., Murray, P., Schawinski, K., Sza-