Toward a universal thesaurus for shot-level indexing of moving images James M Turner (EBSI, Université de Montréal) http://tornade.ere.umontreal.ca/~turner presentation to the UCLA student chapter of the Special Libraries Association 2001 02 28 € This work was carried out under the Steven I. Goldspiel Memorial Research Grant for 1999, awarded to James M Turner and Michèle Hudon (both at EBSI, Université de Montréal) € The title of the project: Organizing moving image collections for the digital era € General focus: study the organization of thesauri for shot-level indexing of moving images, with a view to finding out if it¹s reasonable to build a universal thesaurus Background € A number of traditional library science methods are in use in the management of moving image collections: € constructing a controlled list of descriptors € using a book classification € building a thesaurus € But: € lists of descriptors prove insufficient once the collection becomes large € book classifications adapt poorly to describing images € thesauri work best with specialized collections, but news footage and stockshot material are general in nature How the research proposal came about € Because there are so many kinds of persons, objects and events to describe in a general collection, it was thought that a general thesaurus would become unmanageable at some point € However, film and television librarians often create a thesaurus from scratch when building an information system € Informal talks with stockshot librarians led us to believe that term creation leveled off at some point, possibly around 5000 terms € So we wanted to study this question formally Research questions € What is the point at which term creation levels off? € How many terms for describing moving images does a thesaurus need to describe a general collection adequately? € Are the terms similar from one thesaurus to the next, or are collections so particular that an individual tool is required for each collection? € Would it be reasonable to try to construct a general thesaurus of everyday persons, objects and events that could be shared? Related research issues 1 Research on literacy: many people get by with a vocabulary of only a few thousand words € Some figures: € Analysis by Sproat of a corpus of 44 million words from Associated Press found over 300 000 distinct words in English (Pinker 1994, 129) € Nagy & Anderson found that American students recognize about 45 000 words (Pinker 1994, 150) € 900 words account for 90% of the words in everyday spoken English (Dahl 1979) € 8000 words account for 90% of the words in everyday written English (Dahl 1979) € 4000 words account for 97.5% of texts (all languages) (Guiraud, cited by Deweze 1981, 363) 2 Research on language learning: a core vocabulary is enough to get by in a second language € in learning French, 5000 to 10 000 words are enough to function reasonably, 30 000 words are enough to master the language (informal sources at UdeM) Goals and objectives of the study General goals € Get an understanding of tools for vocabulary management currently used in North American organizations that manage moving image collections € See if there are common patterns in the tools and in the methods used for managing vocabulary Objectives € to discover how many terms, excluding proper names, are contained in a controlled vocabulary for managing general moving images collections before term creation levels off € to identify patterns among terms in the existing thesauri created for moving image collections € to assess how patterns found can contribute to building a shared vocabulary useful for general collections Methodology € The directory of the Association of Moving Image Archivists (AMIA) was used to identify potential partners. € The criteria: institutions that were at least 5 years old, that had general collections, that used shot-level indexing € 33 organizations were considered eligible € A questionnaire concerning the institution, the collections, types of materials held and detailed questions on thesaurus construction was sent out, along with an information kit about the study € Also, personal contacts were made at the Montréal AMIA conference (November 1999), and a number of phone calls were made to try to recruit partners € Our research assistant (Yves Devin) followed up with structured interviews at participating organizations Results € All in all, 11 organizations holding 14 collections agreed to participate Most significant findings: € Use of a thesaurus for managing indexing vocabulary is far from universal € 6 of the 11 participating organizations used some sort of thesaurus € Most of these thesauri have some kind of conceptual structure with descriptors organized using hierarchichal and associative relations, and relations of equivalence € However, few of these had separate lists for proper names, camera angles, emotions, etc. Quantitative analysis Most popular formats held: Film 16mm film: 8/14 35mm film: 8/14 Video 3/4² U-matic 11/14 Betacam 13/14 Size of collections (various ways to measure this): number of titles: 9/14 could supply (range = 4962 to 100 000 titles) viewing hours: 4/14 could supply (range = 750 to 17 848 hours) linear feet or meters: 0/14 (this is the easiest way to measure!) Indexing level: number of collections title: 11/14 sequence: 5/14 shot: 8/14 all three levels: 5/14 other levels: 5/14 (e.g. reel or cassette) Types of indexing tools Key words: 7/14 Classification: 3/14 Thesaurus: 6/14 Subject headings: 6/14 Other: 2/14 Number of terms assigned per shot (8 collections supplied data) 5 or fewer: 4/8 6­10: 1/8 10­15: 3/8 no maximum: 2/8 average: about 12 terms € Updating the controlled vocabulary (6 collections supplied data) As needed 3/6 Daily 1/6 Weekly 1/6 Irregularly 1/6 Lexical analysis € We wanted to study how much overlap there was among the vocabulary found in the various tools € We were able to get our hands on 7 of the indexing tools € We analyzed the contents for the letters F, I and R € first, we eliminated letters of the alphabet with which fewer than 900 dictionary words begin € we also eliminated letters with which more than 5000 words begin € of the 15 mid-range letters remaining, we randomly selected 3 (F, I and R) € First, some elements included under these letters were eliminated from the analysis: € numbers € proper names of persons and organizations € titles of books, songs, films, etc. € Next, the remaining terms were combined into a single list € this contained 2292 distinct terms € of these, 1858 (81%) represent concrete entities € 434 (19%) represent abstract notions Of these 2292 terms in the 7 indexing tools we could review: 1680 terms (73% of the terms) appeared in 1 thesaurus 338 terms (15% of the terms) appeared in 2 thesauri 134 terms (6% of the terms) appeared in 3 thesauri 72 terms (3% of the terms) appeared in 4 thesauri 47 terms (2% of the terms) appeared in 5 thesauri 14 terms (0.6% of the terms) appeared in 6 thesauri 7 terms (0.3% of the terms) appeared in 7 thesauri So 140 terms (not quite 6%) appeared in 4 or more thesauri € These figures are rather surprising, considering the similarity of the material in the collections these tools represent € They weaken our hypotheses to the effect that € a limited number of terms are enough to represent a collection € overlap among the terms would suggest building a common thesaurus € However, further analysis of this data may show that the situation is not so bad € sorting out synonyms would probably show closer coordination € grouping terms under the concepts represented (semantic networks) would probably tighten the figures € additional grouping could be done of identical notions represented at different hierarchical levels among the 7 indexing languages Observations € Public institutions have developed better tools for vocabulary management than private institutions have € However, the quality of public sector tools is deteriorating because of € budget cuts € privatization € the need to justify their existence and show short-term profit € Private institutions don¹t consider good information management a priority, and don¹t seem to understand the need to invest in good information systems € immediate profit is all that counts € One theory: constant improvements to information technology (faster, cheaper computers, cheap digital storage) may be what allows moving image collection managers to get away with sloppy indexing methods € Our best data is from public institutions in our own back yard. This may also explain why we thought we could do this study € It was much easier to find partners willing to participate in the public sector institutions € Private institutions tend to be secretive; however, all partners we recruited were very open and very helpful € It was a problem for both public and private sector partners to allocate resources to compile data for us because of time and production contstraints Conclusions € Shot-level organization of moving image collections in North America is pretty much a free-for-all € Ad hoc systems prevail, and the situation is likely to stay like this in the private sector € Although we couldn¹t show empirically that there are solid foundations for building a universal, multilingual thesaurus for managing moving image collections, the idea still seems to be a good possibility € We are pursuing this idea in the context of follow-up research projects € collaboration with other researchers on building an English-language thesaurus for general collections of images € studying the possibility of semi-automated construction of bilingual thesauri € studying the possibility of constructing multilingual thesauri € European archives seemed better organized, but we are not so sure now Acknowledgments This work was carried out under the Steven I. Goldspiel Memorial Research Grant for 1999, awarded by the Special Libraries Association, Washington € Michèle Hudon, professor at EBSI, Université de Montréal was co-investigator € Yves Devin was research assistant for the project € Andra Darlington organized this talk € Some institutions who participated in the study: € Montreal: National Film Board of Canada € Toronto: Canadian Broadcasting Corporation, vTape € Atlanta: CNN € San Francisco: The Media Archive, Oddball € Los Angeles: UCLA Film & Television Archive, WarnerBros, Channel One € Boston: WGBH References Dahl, H. 1979. Word frequencies for spoken American English. Detroit: Gale Research Company, 1979. Deweze, A. 1981. Réseaux sémantiques : essai de modélisation‹application à l¹indexation et à la recherche documentaire. Lyon? : Université Claude-Bernard. Pinker, S. 1994. The language instinct. New York: William Morrow & Company. Some publications resulting from this project Hudon, Michèle, James M. Turner and Yves Devin. 2000. How many terms are enough? Stability and dynamism in vocabulary management for moving image collections. Proceedings of the 6th International ISKO Congress, Toronto, 10­13 July 2000, Toronto, Canada, edited by Clare Beghtol, Lynne C. Howarth, Nancy J. Williamson. Würzburg: Ergon, 333­338. Turner, James M., Michèle Hudon and Yves Devin. 2000. Text as a tool for organizing moving image collections. Proceedings of the 28th conference of the Canadian Association for Information Science, Edmonton, 2000 05 28. Available at http://www.slis.ualberta.ca/cais2000/turner.htm Turner, James M. 2001. Not quite back to the drawing board: a reasonably successful failure. 10th annual AMIA conference, Los Angeles, November 2000. [not yet available online]