Friday, October 14, 2011

Challenges of Mining Case Citations from Law Review Articles

The previous post was written two weeks ago. And I've been working primarily with Paul Deschner of the HLSL Innovation Lab to design an algorithm that can mine cases from a full text law review database. There are some interesting challenges in doing this.

We've been able to acquire test files with which to perform test searches and here are some of the interesting challenges that we've come up against. First off, we've discovered that the uniform system of citation to case law, which we take for granted today did not exist prior to its widespread adoption in the mid-1930's. Case names were not very regular using the "v." or "vs." between party names. In fact, I've found many footnotes where both forms were used in the same footnote! Designing an algorithm to capture all cases depends on knowing all forms of case citations. For early law reviews, this may present some tricky challenges. In addition to case names with "v." or "vs." we also need to account for other case names, such as "in re".

But, because we're trying to not only find case citations, but to also use the citations to link to the cases themselves, we also need to account for where the case's citation actually falls within the writing. For example, it's fairly common for cases discussed in an article to be mentioned by name in the text of the article, with the citation to the case in the footnote. Linking the name with the citation, so the case itself can be used in some way poses a challenge.

Another challenge is the common use of short form citations to cases. We hope to design our database so that it will rank cases by the numbers of times that it's been referred to in each article and in other articles. The use of short form names means that when we identify a citation, we'll need to design an algorithm that can identify later references to the case, even when the full name is not used. This can be tricky when the short form is either a common name or word, or even when it's not even comprised of a party name....

I've also began to notice two additional factors that present challenges. First, it was fairly common to include tables of cases in law reviews and journals. This could be very helpful in designing our algorithm, or it can present a large challenge. Tables of cases are given a variety of titles in the journals, such as Table of Cases Discussed or, simply Tables of Cases. Sometimes the cases are even listed in indexes that were once regularly published for each volume. When cases are listed in indexes, they are sometimes listed by jurisdiction or court, making the task of identifying the cases cited in articles tricky since we're looking for numbers of times cases are discussed as well as whether they're cited at all.

Another interesting challenge that I hadn't anticipated is the practice among many law reviews and journals (what's the difference between a law review and a law journal, any way?), to run a regular feature usually called something like a "Survey of Recent or Notable" cases. These surveys usually discuss cases in very brief form and amount to abstracts of cases that the law review editors feel are noteworthy for one reason or another. It seems that cases merely mentioned in regular surveys of recent cases don't qualify as "leading" cases. Therefore, references to these cases should probably be discounted.

A New Mode of Full-text Case Retrieval - a work in progress

[This past academic year, John Palfrey, Professor of Law, Vice Dean, Library and Information Resources, Faculty Co-Director, Berkman Center For Internet and Society at Harvard Law School, was intrigued by an idea that I’ve been kicking around for several years and invited me to come to Harvard to work on it.

With the support of an incredibly talented staff in my home library, I felt comfortable taking a semester off. And so, I am visiting, as an Academic Fellow at Harvard Law School Library’s Innovation Lab for the 2011 fall semester.]

Designing a Solution to a Problem
The project that I’ve been asked to explore has to do with the inherent challenges of conducting case law research using full text online databases. The working title of the project is “Leading Case Service” and is designed to make online case law research more productive and more efficient. There are three factors that make online case law research very difficult.

First is the size of the database. It is estimated that there are approximately ten million published cases in the American legal system. The size of the database alone poses very serious difficulties for designers of search engines and indexing systems, both digital and analog.

The size of the corpus of case law in the American legal system isn’t merely the result of our society’s litigious nature. Prior to the mid-nineteenth century, the publication of cases was done very judiciously. Most cases were published in selective case reporters that only published leading cases. In fact, the most influential American case reporter in the nineteenth century was the predecessor to what we know today as American Law Reports, or ALR, and it only published cases of some particular significance, either because the opinion made a ruling on a novel aspect of the law or clarified an issue that had been dealt with by many courts with varying outcomes. In the late nineteenth century, the West Publishing Company entered the case law publishing market and effectively turned cases into a commodity. The method by which West published cases was virtually indiscriminate because it published any and all cases submitted to it by the courts. Its business model was founded on the premise that the more cases it could publish, the better; the more cases it could publish, the more volumes it could sell.

As the volume of cases it published grew, West developed an elaborate subject indexing system to help researchers. We know the indexing system as the Key Number System, and the index as the West Digest System. Today, the index alone numbers several thousand volumes! Coupled with Shepard’s citations, the digest and case-verification systems helped researchers identify both cases that were useful and those that were “still good law,” in the sense that they hadn’t been specifically overruled by another court. This system was extremely accurate, thorough, and objective but still left the researcher with a serious problem of having to wade through a substantial mass of material. The comprehensiveness of the West National Reporter System, its Digests and Shepard’s meant that the cases discovered on any one particular topic could number in the thousands.

The enormous volume of case law poses difficulties for researchers for another reason. Important research in the field of information science that explains that, due to the vagaries of language and other empirical laws of linguistics, full text database searching is by definition inefficient, even in databases filled with documents of a professional nature and highly specialized vocabulary, such as law. Studies have shown that the best a full text search engine is capable of retrieving amounts of only about twenty percent of the relevant documents on a topic. With 10 million published cases, even a 20% efficiency yields far more cases than any person can reasonably be expected to read.

Second, full text databases are objective search tools. This makes full text case law databases very difficult places for researchers to go to find answers about the law. For instance, let’s say you want to know what the law is on the rights of grandparents to intervene in custody proceedings in dissolution cases. A search for cases on this topic, if done with absolute precision, may yield dozens, if not hundreds, of cases, not what you really want or need. In this instance, a more useful approach would be to consult secondary sources, such as handbooks or treatises, that not only discuss the leading cases in the field but also summarize and analyze what these cases mean to the practitioner. Full text case law databases themselves are only part of what researchers need to complete their research.

Third, in order for online databases to be efficiently used, each document, as well as its sections, parts, words, letters, etc., should be indexed and tagged with what’s known in the computing world as meta-data. Indexing on this scale is massive and extremely complex but can make the development of search engines designed to work with these huge databases much more efficient. This is why Westlaw’s and Lexis’s search engines are so useful. Each company runs the full text of each case contained in their databases through extensive indexing and tagging. Part of this process eliminates repetitive words that have no legal meaning, such as articles, conjunctions, etc. Further, the content of the cases is divided into sections, such as majority and minority opinions, jurisdictions, etc., that dramatically helps narrow the search results. Indexing and tagging on this scale is a very costly venture, leaving only Lexis and Westlaw dominating the field. The process also is so complex that each company’s processes are highly guarded trade secrets. The Lexis and Westlaw case law databases are comprised entirely of public domain materials, but they still are extremely expensive to use. Each company cites the high cost of thorough indexing, tagging and sorting as a rationale to charge high prices for access.

A Solution to a Problem
The “Leading Case Service” may be a means of leveling the playing field for newcomers to the online database market, or for existing services that offer access to case law for free. The theory behind the project is that among the ten million cases in the American legal system, there is a relatively small percentage of cases that are considered more significant and interesting than the rest. If these cases can be identified and a search tool developed to exploit them, it may make searching case law more efficient by helping researchers focus on the most important cases first, before moving into the vast body of case law to find newer cases, cases with variant facts or those more specific to a specific jurisdiction.

The first step is to determine if there is such a thing as a group of “leading cases” and, if there is, to figure out how to find them and use them. The theory at present is that this group of cases can be found in the body of secondary materials. Initially, I thought that we could find leading cases in footnotes and body of treatises, presuming that treatise writers would discuss or cite to only the most important cases in their fields. To gather this group of leading cases, we could “mine” treatises and discover the cases cited in them. However, there are significant challenges to mining treatises for the cases the authors have cited, not the least of which is that treatises are published in many different formats and by enough publishers to make it difficult to use a single system to acquire the desired information. I’ve been convinced to put this step on hold; at least for the moment.

Our thinking at present is that we may have better luck focusing our efforts on cases cited in law review articles. There are two reasons that we think that law review articles may be better sources with which to discover this body of leading cases. First, we presume that the writers of law review articles as experts in their fields, are vigilant in identifying important cases in those fields and that overall, these scholars will discuss all the important cases in American law. I realize that this is a strong presumption, but over the last century, virtually all significant developments in law have been discussed and debated at length in law reviews and law journals. It follows, then, that the cases cited by the writers should be the ones most important or significant for one reason or another and can be identified as “leading cases,” those that researchers should read or at least be aware of when researching case law in that field.

A second advantage of using law review articles to identify leading cases is that the body of scholarship is continually expanding. If my theory is correct and we can discover this body of case law, we may be able to create an automated process that will continually add to the corpus of leading cases.

There are many questions to be answered. The most interesting question, and the one that I’ll be spending my time exploring initially is, exactly how many cases are cited by law review articles?

We know that there are around ten million cases published in total, but we don’t know what percentage of those cases found their way into the footnotes and text of law review articles. We are very close to obtaining the tools to answer this question conclusively. Hunches about the percentage of cases discussed in law reviews ranges from 5% to less than 1%, between 100,000 and 500,000 cases. If this is true, then full text case law database searching should be greatly improved by the mere fact that the researcher would be searching in a database of two or three hundred thousand cases instead of ten million!

Assuming that our initial tests reveal that there is, indeed, a body of leading cases that we can identify, many interesting possibilities emerge. The cases themselves may be ranked based upon the numbers of law reviews or journals that have cited them. (This is sort of a twist on Shepard’s service for law review articles. Instead of Shepardizing articles to find cases that cite to the articles, we’re looking for cases cited in the articles themselves.) Other information that may help rank the value of the cases includes the standing of the journal itself in which the article is published, or the reputation, publishing record or school of residence of the author.

Even if we are successful in identifying this corpus of leading cases, we have yet to determine how they should be used. The options are to create a separate database or to use meta-data to tag, or identify the cases so that search engines will be able identify the leading cases from among the rest of the millions of cases in the corpus of American law. Depending on the tags used, the researcher can use this information to sort search results in interesting and valuable ways. For example, a researcher desiring to know what is the law in an area novel to him, could begin with a full text case law database and immediately identify the most important cases in the field. After perusing these cases, links and metadata could then be used to immediately find articles, blogs and other pertinent online materials.

The goal of the project is to create a new way of using online digital legal materials. New technologies have allowed us to think of combining information in ways that were unheard of, even unthinkable, before today. “Leading Case Service” is essentially a ‘mash-up’ of online case law databases and online databases of law review articles. To this mash-up, colleagues have suggested that we may be able to add blogs, digital commons, wire-services, websites, legal periodical indexes and possibly treatises. The use of this information is not merely academic. It may also prove to be a way to power new search engines or discover new ways that various parts of the conceptual, scholarly world of the law influence each other.

Friday, January 14, 2011

Waiting for the Other Shoe to Drop

I'm baffled by publishers' arrogance these days. Two recent events made me whack my head with the palm of my hand….

Law Journal Seminars Press is now rolling out a "fantastic" new program for their books. Instead of merely paying for the looseleaf supplements for their books (for the most part reasonably priced, by the way), we can now either opt to receive them in print and online, or online only. Print and online, of course, costs more than the print supplements alone. Online only costs about the same.

I'm in an academic law library and online has absolutely no interest for me - or my patrons. Apparently, each title would have to have a separate login. So, if I did opt for either option, I'd need to keep track of the various passwords for each title. Good grief. I can't imagine a more inconvenient process.

We're canceling all Law Journal Seminars Press titles.

The other situation is even more annoying. And it always has been. We subscribe to the Economist ($138/year) and route it among the faculty and put the routed one in the faculty lounge. They've got a pretty nice online service with email alerts, etc., and we looked into getting an online subscription. After six months, they finally got back to us with a "fantastic" deal: $1500 per year for online access for our library. Are they mad? Do they really think that only one person reads our single subscription to the print version?

Come to think of it, I wonder why they don't charge volume rates for the print version any way?

Actually, I'm pretty sure that that's coming….

Tuesday, January 04, 2011

Where are the Catalogers? Proposed Amendment to the Durham Statement

Reflecting on the character of the Durham Statement
As the scholarship becomes more widely available in digital formats, it is critical that we seek input from catalogers and technical services librarians on how to make these digital resources as useful and usable as possible.
I've been thinking about the meaning to legal researchers and legal bibliographers of the Durham Statement. It has occurred to me that there is a very important stake-holder/contributor is missing from the statement. The Statement involves aspirations that reflect observations and presuppositions about how we feel the future will affect the publication of modern legal scholarship. The observations and assumptions are well and good, and, at least partially true. (See my post, "Why I'm Signing the Durham Statement” 2/6/2010)
The Statement accurately reflects how technology is speeding the process of the digitization legal scholarship along and, therefore, calls on law schools to immediately and expeditiously cease publication of their journals in print in favor of digital formats. It presumes, of course, that technology is developing in such a way that researchers will prefer to access the information digitally, and, not only that the technology either exists, or will exist, but that it will be presented in a form that researchers can use. The two parties present in this scenario are researchers and technicians.

The missing party
The best technology available, in terms of readers, websites, formats, etc., and the scholarship it contains is only as good (to a large and very practical extent) as the form of the content itself. One of the advantages of print media was that it's very "artifactness" seemed to beg the question, but that it must be curated properly. That is, each item received into a library should be thoroughly analyzed and objectively described in order to facilitate discovery and access, and, hence, usage. Part of the nature of analog scholarship is its very permanence: once printed, its memorialized, it becomes an artifact. It is, thus, capable of objective description. This description, seen by cynics as bibliographic hypertrophy, is what we know as MARC, LC Classification and LC Subject Heading description. These descriptions and analyses made collection of, and access to library materials standard. Despite our cynicism, the system as a whole served/serves us well. RLIN and OCLC and the host of OPAC and serial automation vendors made access to collections remarkably easy, especially when compared to the alternatives. (Librarians simply storing materials as they see fit based on their own understanding of a subject, for instance.)
It is time to subject digitally produced, born digital scholarship to the same rigorous analysis. It is almost certainly the case that the old wineskins won't meet the needs of modern libraries. MARC record format, Library of Congress Classification and Subject Headings would likely all need to be modified or revised to served the needs of modern formats that aren’t physical or don’t possess the physical character of printed law review articles, for example. This new analysis would need to take advantage of things like metadata and hypertext links and would be less concerned, of course, with organizing the materials themselves, but could provide important tools to allow others to organize, use and access them
Whatever the exact format, form or nature, it is clear that production and distribution of born digital scholarship will benefit from systematic, standard analysis and bibliographic description. If each article of a born digital law review, journal or scholarly blog was subjected to standard bibliographic description and analysis, it could serve the user in many ways. First, it may facilitate the development of better search engines that mine this important form of legal scholarship. Second, it may also facilitate the creation of better, more secure storage formats. It would also bring thoughtful vigor to the process of digitization and make stability of formats not merely useful, but desirable to librarians and technicians.
The alternative of doing nothing and letting technology take care of itself results in relying on Google as a search engine. Google’s fine to an extent, but its functionality and reliability as a search engine is not consistent or reliable. Its quickness and ubiquity make it an easy thing thing to rely upon and use despite its limitations. With something so cheap and easy, it’s very easy to overlook its shortcomings. But as the volume of digital scholarship increases Google’s limitations may become more and more apparent and it may be harder for users to shrug off the annoyance of Google’s inherently sloppy indexing.

It's an easy conclusion that we must bring Technical Services to the table as we endorse (more or less) the migration from print to digital formats.

The Proposal
I don’t have specific language in mind for how the Durham Statement should be amended or supplemented. It would be something to the effect that the signatories commit to involving TS departments and experts in the process of digitization of their law reviews.