Command-line utilities for managing and exploring annotated corpora

Joel Nothman, Tim Dawborn and James Curran

Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT 2014)
Dublin, Ireland, August 23-29, 2014


Users of annotated corpora frequently perform basic operations such as inspecting the available annotations, filtering documents, formatting data, and aggregating basic statistics over a corpus. While these may be easily performed over flat text files with stream-processing UNIX tools, similar tools for structured annotation require custom design. Dawborn and Curran (2014) have developed a declarative description and storage of structured annotations, on top of which we have built generic command-line utilities. We describe the most useful utilities – some for quick data exploration, others for high-level corpus management – with reference to comparable UNIX utilities. We suggest that such tools are universally valuable for working with structured corpora; in turn, their utility promotes common storage and distribution formats for annotated text.

