The National Archives of Finland has carried out format study within our co:op project with the primary purpose of reviewing and analyzing the available file formats for the storage of automatically recognized text or manually input text (transcription). The automatic recognition can be either OCR-based (i.e. recognition of printed text) or HTR-based (i.e. recognition of hand-written text).
The existing file formats are described from the point of view of their structure and special characteristics and links to schema files or more detailed descriptions of the formats are given. Also, an attempt is made to list some of the projects, organizations and pieces of software using the formats. Finally, a summary and comparison of the reviewed file formats is provided.
Another purpose of this format study is to analyze the applicability of the file formats in the environment of the National Archives of Finland. This requires state-of-the-art analysis identifying current systems related to e.g. long-term preservation of documents, metadata handling and information search as well as describing the foreseen changes in the environment in the near future. In addition to that, requirements concerning the types of usage potentially enabled by the existence of OCR:ed / HTR:ed document text are listed. Finally, the potential implications of fulfilling the listed requirements on processes, other systems and processing are analyzed.