Abstract

The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations fromthe last few years, and surveys other corpus tools and websites.

Get full access to this article

View all available purchase options and get full access to this article.

References

Ambati, B.R., S. Reddy, and A. Kilgarriff. 2012. Word sketches for Turkish. In Proc LREC, 2945–2950. Istanbul.
Anthony, L. 2004. AntConc: a learner and classroom friendly, multi-platform corpus analysis toolkit. In Proc IWLeL, 7–13.
Arts. T., ed. 2014. Oxford Arabic Dictionary. Oxford: Oxford University Press.
Arts, T., Y. Belinkov, N. Habash, A. Kilgarriff, and V. Suchomel. 2014 (forthcoming). arTenTen and word sketches for Arabic. Journal of King Saud University: Computing and Information Science. Special issue on Arabic natural language processing.
Baisa, V., M. Jakubíček, A. Kilgarriff, V. Kovář, and P. Rychlý. 2014. Bilingual word sketches: the translate button. In Proc EURALEX, Bolzano/Bozen
Baisa, V., and V. Suchomel. 2012. Large corpora for Turkic languages and unsupervised morphological analysis. In Proc LREC, Istanbul
Basile, V., J. Bos, K. Evang, and N. Venhuizen. 2012. Developing a large semantically annotated corpus. In LREC vol. 12, 3196–3200.
Baroni, M., and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proc LREC, Lisbon
Baroni, M., S. Bernardini, A. Ferraresi, and E. Zanchetta. 2009. The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.
Bick, E. 2009. DeepDict—a graphical corpus-based dictionary of word relations. In Proc NODALIDA, Vol. 4, 268–271.
Biemann, C., S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. In Computational linguistics and intelligent text processing, 217–228. Berlin Heidelberg: Springer.
Burnard, L. 1995. The BNC reference manual.
Christ, O., and M. Schulze. 1994. The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual. University of Stuttgart.
Chung, S.-F., and C.-R. Huang. 2010. Using collocations to establish the source domains of conceptual metaphors. Journal of Chinese Linguistics 38(2): 183–223.
Culpeper, J., and M. Kytö. 2010. Early Modern English dialogues: spoken interaction as writing. Cambridge: Cambridge University Press.
Davies, M. 2009. The 385+ million word Corpus of Contemporary American English (1990–2008+): design, architecture, and linguistic insights. International Journal of Corpus Linguistics 14(2): 159– 190.
Frankenberg Garcia, A. 2014. The use of corpus examples for language comprehension and production. ReCALL.
Garrett, E., N.W. Hill, A. Kilgarriff, R. Vadlapudi, and A. Zadoks. 2014. The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries. In The Third International Conference on Tibetan Language, eds. Tuttle, Gya, Dare and Wilber, New York: Trace Foundation (forthcoming).
Greaves, C. 2009. ConcGram 1.0: a phraseological search engine. John Benjamins.
Hanks, P. 2008. Mapping meaning onto use: a Pattern Dictionary of English Verbs. In Proc AACL, Utah.
Hanks, P. 2012. The corpus revolution in lexicography. International Journal of Lexicography 25(4): 398–436.
Hardie, A. 2012. CQPweb—combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics 17(3): 380–409.
Hà, P.T., N.T.M. Huyền, L.H. Phương, and A. Kilgarriff. 2012. Nghiên cứu từ vựng tiềng Viêt vựi hê thImage (oacute-circ.png) is missing or otherwise invalid.ng sketch engine. Tạp chí Tin học và Điều khiển học 27(3): 206–218.
Huang, C.-R., K.-J. Chen, and Q.-X. Lai. 1997. Image (page29-a.png) is missing or otherwise invalid. Image (page29-b.png) is missing or otherwise invalid. Press. Mandarin Daily Dictionary of Chinese Classifiers. Taipei: Mandarin Daily
Huang, C.-R., J.-F. Hong, W.-Y. Ma, and P. Šimon. 2014. From corpus to grammar: automatic extraction of grammatical relations from annotated corpus. In Tsou and Kwong Eds. Linguistic Corpus and Corpus Linguistics in the Chinese Context. Journal of Chinese Linguistics Monograph. Hong Kong: Chinese University of Hong Kong Press, (forthcoming).
Huang, C-R., A. Kilgarriff, Y. Wu, C.M. Chiu, S. Smith, P. Rychly, M.H. Bai, and K.-J. Chen. 2005. Chinese Sketch Engine and the extraction of grammatical collocations. In Proc Fourth SIGHAN Workshop on Chinese Language Processing, 48–55.
Jakubíček, M., A. Kilgarriff, D. McCarthy, and P. Rychlý. 2010. Fast syntactic searching in very large corpora for many languages. In Proc PACLIC, Vol. 24, 741–747, Japan.
Jakubíček, M., A. Kilgarriff, V. Kovář, P. Rychlý, and V. Suchomel. 2013. The TenTen corpus family. Lancaster: In Proc. Int. Conf. on Corpus Linguistics.
Kerswill, P., J. Cheshire, S. Fox, and E. Torgersen. 2013. English as a contact language: the role of children and adolescents. In English as a Contact Language, 258.
Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1): 97–133.
Kilgarriff, A. 2007. Googleology is bad science. Computational linguistics 33(1): 147–151.
Kilgarriff, A. 2012. Getting to know your corpus. In Text, Speech and Dialogue, 3–15. Berlin Heidelberg: Springer.
Kilgarriff, A. 2013. Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proc ASLIB 35th Translating and the Computer Conference, London.
Kilgarriff, A., and M. Rundell. 2002. Lexical Profiling Software and its lexicographic applications: a case study. In Proc EURALEX. Copenhagen, Denmark.
Kilgarriff, A., P. Rychlý, P. Smrz, and D. Tugwell. 2004. The Sketch Engine. In Proc Eleventh EURALEX International Congress. Lorient, France. 36 Lexicography ASIALEX (2014) 1:7–36
Kilgarriff, A., C.R. Huang, P. Rychlý, S. Smith, and D. Tugwell. 2005. Chinese word sketches. In Proc ASIALEX 2005: Words in Asian cultural context. Singapore.
Kilgarriff, A., M. Husák, K. McAdam, M. Rundell, and P. Rychlý. 2008. GDEX: automatically finding good dictionary examples in a corpus. In Proc. Euralex. Barcelona
Kilgarriff, A., and I. Renau. 2013. esTenTen, a Vast Web Corpus of Peninsular and American Spanish. Procedia Social and Behavioral Sciences 95: 12–19.
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. Proc MT summit 5: 79–86.
Kosem, I., M. Husak, and D. McCarthy. 2011. GDEX for Slovene. In Proceedings of eLex, 151–159. Bled, Slovenia.
Kosem, I., V. Baisa, V. Kovář, and A. Kilgarriff. 2013. User-friendly interface of error/correction- annotated corpus for both teachers and researchers. Solstrand: Proc Learner Corpus Research.
McGillivray, B., and A. Kilgarriff. 2013. Tools for historical corpus research, and a corpus of Latin. In New Methods in Historical Corpora, Bennett, P.D. ed. Vol 3. BoD–books on demand.
O’Donnell, M. 2008. Demonstration of the UAM CorpusTool for text and image annotation. In Proc 46th ACL: Demo Session, 13–16. Association for computational linguistics.
Pomikálek, J. 2011. Removing boilerplate and duplicate content from Web Corpora. PhD thesis, Masaryk University, Brno, Czech Republic.
Quasthoff, U., M. Richter, and C. Biemann. 2006. Corpus portal for search in monolingual corpora. In Proc LREC, 1799–1802. Genoa, Italy.
Renouf, A., A. Kehoe, and J. Banerjee. 2006. WebCorp: an integrated system for web text search. Language and Computers 59(1): 47–67.
Rundell, M. ed. 2002. Macmillan English Dictionary for Advanced Learners. Macmillan.
Rundell, M. 2012. Stop the presses—the end of the printed dictionary. Macmillan Dictionary Blog, 5 Nov. http://www.macmillandictionaryblog.com/bye-print-dictionary.
Rychlý, P. 2000. Korpusové manažery a ∽ jejich efektivní implementace.Rychlý. PhD Thesis, Masaryk University, Brno, Czech Republic.
Rychlý, P. 2007. Manatee/bonito–a modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70. Masaryk University, Brno, Czech Republic.
Sanseido. 2003, 2007. The WISDOM English–Japanese Dictionary. Sanseido.
Schäfer, R., and F. Bildhauer. 2013. Web corpus construction. Synthesis Lectures on Human Language Technologies 6(4): 1–145.
Scheible, S., R.J. Whitt, M. Durrell, and P. Bennett. 2011. A gold standard corpus of Early Modern German. In Proc 5th Linguistic Annotation Workshop, 124–128. Association for computational linguistics.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. Baroni and Bernardini, 63–98. Bologna: Gedit.
Srdanovic Erjavecs, I., Erjavec, T., and Kilgarriff, A. 2008. A web corpus and word sketches for Japanese. Information and Media Technologies, 3(3).
Suchomel, V., and J. Pomikálek. 2012. Efficient Web crawling for large text corpora. In Proc Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.
Thomas, J. 2014. Discovering English with the Sketch Engine. Print-on-demand. http://ske.li/deske. Tiedemann, J., and L. Nygaard. 2004. The OPUS Corpusparallel and free. Lisbon: Proc LREC.
Wild, K., A. Kilgarriff, and D. Tugwell. 2013. The Oxford Children’s Corpus: using a Children’s Corpus in Lexicography. International Journal of Lexicography 26(2): 190–218.

Information & Authors

Information

Published In

Go to Lexicography
Lexicography
Volume 1Number 12014
Pages: 7 - 36

History

Published online: 4 November 2024

Keywords

  1. online music teaching
  2. music course design
  3. music education

Authors

Affiliations

Adam Kilgarriff [email protected]
Author
Lexical Computing Ltd., GB
Author
Lexical Computing Ltd., GB
Author
Lexical Computing Ltd., GB
Miloš Jakubíček [email protected]
Author
Lexical Computing Ltd., GB
Vojtěch Kovář [email protected]
Author
Lexical Computing Ltd., GB
Jan Michelfeit [email protected]
Author
Lexical Computing Ltd., GB
Pavel Rychlý [email protected]
Author
Lexical Computing Ltd., GB
Vít Suchomel [email protected]
Author
Lexical Computing Ltd., GB

Metrics & Citations

Metrics

VIEW ALL METRICS

Related Content

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Format





Download article citation data for:
Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel
Lexicography 2014 1:1, 7-36

View Options

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options

EPUB

View EPUB

Full Text

View Full Text

Figures

Tables

Media

Share

Share

Copy the content Link

Share on social media

About Cookies On This Site

We use cookies to improve user experience on our website and measure the impact of our content.

Learn more

×