Data and Metadata: An Interview with Murtha Baca and Erin Coburn
You say Ugolini Lorenzetti, I say Bartolommeo Bulgarini
Eve Meltzer, Julia Meltzer, Murtha Baca, and Erin Coburn
As museums, art institutions, and art libraries digitize their collections for greater accessibility, the question of how to categorize and define works of art becomes increasingly important. The Getty Research Institute, a program of the J. Paul Getty Trust, spearheads a program to standardize the vocabularies used to define artworks so that as digital data about collections is created, a common form and language can be used. The challenges are many: What is a “standard” language for defining an art object and how is this determined and agreed upon? Who should comprise the communities who establish these standards? On what criteria do we base standards in a field for which the very notion is already controversial?
Murtha Baca is the head of Digital Resources Management and the Vocabulary Program at the Getty Research Institute in Los Angeles. She oversees the creation of digital resources relating to the collections of the Getty Research Institute. Erin Coburn is the Data Standards Administrator at the J. Paul Getty Museum. Her work focuses on data standards and the creation and use of controlled vocabularies for describing and accessing information on the Getty Museum’s collection, and providing descriptive metadata for the Museum’s collection online. Eve Meltzer and Julia Meltzer met them in September 2003.
Eve Meltzer and Julia Meltzer: What are so-called “vocabularies” and how did the Vocabulary Program begin at the Getty Research Institute?
Murtha Baca: Vocabularies gather all the different ways—right and wrong—of calling things, so that people of different levels of expertise can find things within a collection. The Getty Vocabulary Program, working as a unit with the Getty Standards Program, builds, maintains, and disseminates vocabulary tools for the visual arts and architecture. The vocabularies produced by the Getty include: the Art & Architecture Thesaurus® (AAT), the Union List of Artist Names® (ULAN), and the Getty Thesaurus of Geographic Names (TGN)®. These resources are intended to aid in the documentation and retrieval of automated information about art, architecture, and material culture.
Can you give us a brief history of museum standards?
MB: More than a decade ago, Categories for the Description of Works of Art, or CDWA, was initiated. The director of what was then called the Art History Information Program at the Getty determined that art museums and other cultural heritage institutions needed a metadata standard the way the library world has MARC. MARC is the main data “container” for bibliographic information in the library world; it stands for MAchine-Readable Cataloging format. MARC defines a data format that emerged from a Library of Congress-led initiative that was begun 30 years ago. It provides the mechanism by which computers exchange, use, and interpret bibliographic information—its data elements make up the foundation of most library catalogues used today. The archival world has a couple of metadata containers that information goes into, but the art museum world didn’t have that. So CDWA was initiated in the late 1980s. As in the library world, this set of data categories was developed through consensus. You get all the people who are the experts in the different fields into the same meetings again and again and again and they debate and debate and debate and they come to a consensus on the necessary categories.
Who are the people who attend those meetings?
MB: The kinds of people who were in those meetings were not just information systems people but also curators—they actually recruited curators who were experts in various areas, such as Asian art, so it wasn’t just folks in the typical fields of western European art. They also recruited people who work with different types of media: curators, librarians, registrars—all the various types who are needed to contribute to an information system. There were, as well, experts from major institutions such as the Getty, the Guggenheim, MoMA, and the Met.
Then out of CDWA came Object ID?
MB: Yes, both Object ID and the VRA (Visual Resources Association) Core Categories are two other metadata element sets that are really subsets of CDWA. CDWA is huge: it’s comprised of hundreds of categories. VRA Core is a metadata “container” for the data necessary to catalogue works of art and material culture artifacts; unlike CDWA, VRA Core focuses on visual “surrogates” of works of art (slides, photographs, digital images). Object ID is another metadata element set: it’s only ten elements and that’s really looking at the art object as a piece of property that can be stolen or protected or taken across national boundaries. So CDWA cares about the context in which a work was created—the social and historical context: who the patron was, why the object was created, its original location, and so on. Object ID does not care about that. Object ID is used to identify works of art as cultural property. It really cares about what the thing looks like and how you can identify it.
When you say “metadata element set,” what do you mean?
Erin Coburn: The categories of information that you need to describe something—well, not always to describe, but the kind of metadata we’re talking about is descriptive metadata. So, I need to know the title, who created it, how big it is. That is the kind of information you begin with and then you populate those fields with data values from a controlled vocabulary.
Why is it necessary to have standards?
EC: Because data is very labor intensive to produce. In order to have an information system, you have to buy computers, other hardware, and software. All of that can cost a lot of money. But one of the costliest factors often overlooked is the human labor to create and maintain the data. If you do it consistently, it’s easier to migrate when you buy a new system later. It also makes it easier to contribute to consortiums. For example, with a big consortium of art museum information, the only way you can really manage the information efficiently and create meaningful access to it is if everyone agrees upon a shared standard and the same “buckets” of information that they can map the data to. Theft is also a big issue when it comes to justifying the importance of documenting collections. Consider what happened at the Iraqi National Museum in Baghdad. If they had had better documentation, it would now be easier to identify found objects as belonging to the Museum. It’s amazing how many collections aren’t properly documented.
MB: For example, two core metadata elements of Object ID are: 1) ownership—i.e., who does the object belong to—and 2) distinguishing marks. What if you have a whole bunch of little statues of Buddha? Unless you’re an expert, you would not know the difference between them, but if you’ve documented them and noted some distinguishing mark like a scratch, that can also help you identify the object as a form of cultural property.
EC: Standards are also important because people don’t recognize how complex it could be on the back end. You might say that a photograph is from Paris, but there’s more than one Paris in the world. And so, without having some sort of standards in place, “Paris” by itself is meaningless, unless you say, “Paris, France,” or “Paris, Texas.”
What kind of struggles did you come up against in the creation of standards, specifically in working with curators or administrators and their particular modes of thinking?
MB: That’s a huge issue. We’ve even struggled with the nomenclature itself. In general, people tend to fear the word “standards.” They don’t understand what an “authority file” is, but it just sounds bad. “Controlled vocabulary” sounds really scary, too. No one wants to be controlled. So a lot of it is psychological education. You need to emphasize that standards are good: they’re going to liberate, not imprison. What we say is that the curator can call an object anything he or she chooses. If they want to call it a lekythos or an étagère or a cartonnier—that’s fine. The word desk, for example, might make a curator run out of the room screaming, “That’s not a desk, it’s a cartonnier!” In effect, vocabularies gather all the different ways of designating or naming things, as well as more generic and more specific terms and names, so that people of different levels of expertise can effectively find what they are looking for, or what is there to be found. But the issue is really one of trust. It’s also a big change in terms of practices of administration. And it was a big psychological change. It’s a big change in the way people think about their work. In order to do this kind of work, you have to be on teams with all different sorts of people, from the curator to the guy who scans the images, to the cataloguer who is one of the most important people. So it also presents people with a different way of thinking about their work.
Is the Getty at the forefront of defining these standards?
MB: Yes, the Getty Museum has been at the forefront of producing and controlling information correctly and appropriately, and exposing the public to it. We pioneer in the implementation of standards and controlled vocabularies for art museum information. The Getty Information Institute—some of whose programs, including the Vocabulary Program, are now part of the Getty Research Institute—together with the College Art Association spearheaded the development of CDWA. And we’ve been developing our three vocabularies for 20 years. Nobody else in the art information world really has anything like that. The Library of Congress Subject Headings have been an established authority in the library world for many years, but they’re quite user-unfriendly, and don’t focus on just art and architecture. The Art & Archi tecture Thesaurus (AAT) was begun in 1980, and the Union List of Artist Names (ULAN) began in the late 1980s; they both focus on art and material culture. The Thesaurus of Geographic Names (TGN) focuses on places that are important for art, but it’s worldwide coverage and you can use it for anything. In fact, our vocabularies get licensed by travel agencies and commercial vendors. Again, the way people search on the Internet for now, and for the foreseeable future, is through words. So if the end user is searching for Firenze but I’m calling it Florence in my database, then he won’t find what he’s looking for. So we cluster together all the forms of names associated with a particular person, place, or thing.
Can you tell us about thesauri?
MB: Thesauri are another type of controlled vocabulary in which there are hierarchical relationships. You could have a controlled vocabulary that is just an alphabetical list: a list of everybody who works in your company with their preferred names and other relevant data. That’s a type of controlled vocabulary. A thesaurus has a hierarchical structure. For example, in the TGN, Europe is the continent, then Italy, and underneath those terms are the different regions and provinces of Italy. And it is also very powerful for searching: using the AAT, a user could say, “Go get me all the desks” in a particular collection and he would also retrieve a secrétaire à abattant, even though he had never heard that expression before.
What is controlled about controlled vocabularies?
MB: A controlled vocabulary designates a preferred form. Again, this is a vestige of library language. For example, let’s say I’m doing research on a particular 14th-century Italian painter. I search for “Ugolino Lorenzetti” and I get back a record for “Bartolommeo Bulgarini.” Here’s what is controlled, especially in the library world: the Library of Congress would say that if you’re a cataloguer and you’re cataloguing books about this particular artist, you should use “Bulgarini, Bartolommeo,” spelled exactly that way, and in inverted order. So if you’re cataloguing a book about this artist, even if the title of the book is The World of Ugolino Lorenzetti, the preferred term in the Library of Congress Name Authority File (LCNAF) is “Bulgarini, Bartolommeo.”
This is a vestige of how libraries are physically ordered. I’ve got to go look under the B’s to find this artist. Now in the online environment, all of the different names or spellings are potential access points. So, if a museum prefers to call this artist “Master of the Ovile Madonna,” we don’t care. We don’t say, “Oh no, you must call him Bulgarini, Barto lommeo,” because the controlled vocabulary “knows” that all those name forms refer to the same artist. I picked this one because it’s a dramatic example. Over time, the works by the same person have been assigned dramatically different names. Let me explain—you see, Master of the Ovile Madonna was a designation used for a particular “hand” that had been associated with several paintings, including the so-called Ovile Madonna. Then in the early 20th century, Bernard Berenson, who was a famous critic, basically made up the name Ugolino Lorenzetti for this artist, because he was a follower of the Lorenzetti brothers. So there’s a lot of literature about this artist calling him Ugolino Lorenzetti. Then later, they actually discovered documents that the artist’s real name was Bartolommeo Bulgarini. So, what we do is we cluster together all these different forms; we don’t suppress the “wrong” or old forms, and we don’t force people to use our preferred form.
EC: Here’s another good example that demonstrates a controlled vocabulary: let’s search in the ULAN on Tiziano Vecellio, or “Titian.” See the first name? Normally an artist is known by his vernacular name, and Titian’s vernacular name was Tiziano Vecellio. But in all the literature, he is known as “Titian.” So the very first one in the list is “Titian”—it’s “preferred,” and it’s the display name and it’s English preferred. But you will also get many weird spellings; they come from archival documents. At the Getty, we allow an infinite number of variant names because we see them as potential access points.
Do you include misspellings?
EC: Good question. We include published misspellings, such as “Georgia O’Keefe” (there should be two f’s), but we don’t include every possible misspelling that a user might use; fuzzy searching algorithms can handle some of that. How do you get other institutions to adhere to these standards?
MB: That’s the hard part. In the library world, you have to use MARC or you can’t live—you’re just not a library. Libraries also have to use Library of Congress Subject Headings and names. Not only are you not a library if you don’t use MARC or LCSH, you also can’t contribute to the big bibliographic utilities. So if I want to contribute to the RLIN bib file—that is, the Research Library Information Network where you can search all of the major research libraries in North America at the same time—I’ve got to contribute my records in MARC format. They just won’t take it any other way. These standards didn’t exist before in the museum world. Now the task before us is to develop big consortial entities of information and the struggle arises due to the fact that data exists in every kind of form you can imagine.
EC: It’s been an up-hill battle, but recently, especially because we now have a real life example to show with the Getty Museum, we can say this is why it pays to use standards—because you show how the searching works on our own website and on Google. Then people see the benefit in using data standards and controlled vocabularies.
MB: Today the Getty vocabularies are used all over the place—by museums and cultural heritage institutions. People who are totally outside of the art world use the TGN . Between them, the three vocabularies get about 150,000 searches per month on our website. TGN is always in the top two or three web pages that are accessed at the Getty. People can also license the data and then use it in their own systems.
What do you do about coming up with terminology for contemporary art?
MB: It’s not that big a deal. Not too long ago, I was asked by a colleague to help develop a workshop around cataloguing contemporary art and we started having big arguments about it. This person believed that that it was a whole different ball game; that you need different metadata elements and different vocabularies for contemporary art. I said I can’t teach this workshop with you because I can’t go to a national conference and say that you need different standards for contemporary art. As I see it, it is simply not true. You’ve got creator information and you’ve got title information whether it’s a painting of the Adoration of the Magi or if it’s a guy nailing himself to a wall in a gallery.
Murtha Baca is the head of Digital Resources Management and the Vocabulary Program at the Getty Research Institute in Los Angeles. She oversees the creation of digital resources relating to the collections of the Getty Research Institute.
Erin Coburn is the Data Standards Administrator at the J. Paul Getty Museum. Her work focuses on data standards and the creation and use of controlled vocabularies for describing and accessing information on the Getty Museum’s collection, and providing descriptive metadata for the Museum’s collection online.
Eve Meltzer is currently a humanities postdoctoral fellow in the Department of Art and Art History at Stanford University, where she teaches contemporary art history and theory. She is working on a book about language and information in art practices of the 1960s and 1970s.
Julia Meltzer is a media artist and executive director of Clockshop, a non-profit media and art organization. She lives in Los Angeles.