Friday, October 14, 2011

Challenges of Mining Case Citations from Law Review Articles

The previous post was written two weeks ago. And I've been working primarily with Paul Deschner of the HLSL Innovation Lab to design an algorithm that can mine cases from a full text law review database. There are some interesting challenges in doing this.

We've been able to acquire test files with which to perform test searches and here are some of the interesting challenges that we've come up against. First off, we've discovered that the uniform system of citation to case law, which we take for granted today did not exist prior to its widespread adoption in the mid-1930's. Case names were not very regular using the "v." or "vs." between party names. In fact, I've found many footnotes where both forms were used in the same footnote! Designing an algorithm to capture all cases depends on knowing all forms of case citations. For early law reviews, this may present some tricky challenges. In addition to case names with "v." or "vs." we also need to account for other case names, such as "in re".

But, because we're trying to not only find case citations, but to also use the citations to link to the cases themselves, we also need to account for where the case's citation actually falls within the writing. For example, it's fairly common for cases discussed in an article to be mentioned by name in the text of the article, with the citation to the case in the footnote. Linking the name with the citation, so the case itself can be used in some way poses a challenge.

Another challenge is the common use of short form citations to cases. We hope to design our database so that it will rank cases by the numbers of times that it's been referred to in each article and in other articles. The use of short form names means that when we identify a citation, we'll need to design an algorithm that can identify later references to the case, even when the full name is not used. This can be tricky when the short form is either a common name or word, or even when it's not even comprised of a party name....

I've also began to notice two additional factors that present challenges. First, it was fairly common to include tables of cases in law reviews and journals. This could be very helpful in designing our algorithm, or it can present a large challenge. Tables of cases are given a variety of titles in the journals, such as Table of Cases Discussed or, simply Tables of Cases. Sometimes the cases are even listed in indexes that were once regularly published for each volume. When cases are listed in indexes, they are sometimes listed by jurisdiction or court, making the task of identifying the cases cited in articles tricky since we're looking for numbers of times cases are discussed as well as whether they're cited at all.

Another interesting challenge that I hadn't anticipated is the practice among many law reviews and journals (what's the difference between a law review and a law journal, any way?), to run a regular feature usually called something like a "Survey of Recent or Notable" cases. These surveys usually discuss cases in very brief form and amount to abstracts of cases that the law review editors feel are noteworthy for one reason or another. It seems that cases merely mentioned in regular surveys of recent cases don't qualify as "leading" cases. Therefore, references to these cases should probably be discounted.

A New Mode of Full-text Case Retrieval - a work in progress

[This past academic year, John Palfrey, Professor of Law, Vice Dean, Library and Information Resources, Faculty Co-Director, Berkman Center For Internet and Society at Harvard Law School, was intrigued by an idea that I’ve been kicking around for several years and invited me to come to Harvard to work on it.

With the support of an incredibly talented staff in my home library, I felt comfortable taking a semester off. And so, I am visiting, as an Academic Fellow at Harvard Law School Library’s Innovation Lab for the 2011 fall semester.]

Designing a Solution to a Problem
The project that I’ve been asked to explore has to do with the inherent challenges of conducting case law research using full text online databases. The working title of the project is “Leading Case Service” and is designed to make online case law research more productive and more efficient. There are three factors that make online case law research very difficult.

First is the size of the database. It is estimated that there are approximately ten million published cases in the American legal system. The size of the database alone poses very serious difficulties for designers of search engines and indexing systems, both digital and analog.

The size of the corpus of case law in the American legal system isn’t merely the result of our society’s litigious nature. Prior to the mid-nineteenth century, the publication of cases was done very judiciously. Most cases were published in selective case reporters that only published leading cases. In fact, the most influential American case reporter in the nineteenth century was the predecessor to what we know today as American Law Reports, or ALR, and it only published cases of some particular significance, either because the opinion made a ruling on a novel aspect of the law or clarified an issue that had been dealt with by many courts with varying outcomes. In the late nineteenth century, the West Publishing Company entered the case law publishing market and effectively turned cases into a commodity. The method by which West published cases was virtually indiscriminate because it published any and all cases submitted to it by the courts. Its business model was founded on the premise that the more cases it could publish, the better; the more cases it could publish, the more volumes it could sell.

As the volume of cases it published grew, West developed an elaborate subject indexing system to help researchers. We know the indexing system as the Key Number System, and the index as the West Digest System. Today, the index alone numbers several thousand volumes! Coupled with Shepard’s citations, the digest and case-verification systems helped researchers identify both cases that were useful and those that were “still good law,” in the sense that they hadn’t been specifically overruled by another court. This system was extremely accurate, thorough, and objective but still left the researcher with a serious problem of having to wade through a substantial mass of material. The comprehensiveness of the West National Reporter System, its Digests and Shepard’s meant that the cases discovered on any one particular topic could number in the thousands.

The enormous volume of case law poses difficulties for researchers for another reason. Important research in the field of information science that explains that, due to the vagaries of language and other empirical laws of linguistics, full text database searching is by definition inefficient, even in databases filled with documents of a professional nature and highly specialized vocabulary, such as law. Studies have shown that the best a full text search engine is capable of retrieving amounts of only about twenty percent of the relevant documents on a topic. With 10 million published cases, even a 20% efficiency yields far more cases than any person can reasonably be expected to read.

Second, full text databases are objective search tools. This makes full text case law databases very difficult places for researchers to go to find answers about the law. For instance, let’s say you want to know what the law is on the rights of grandparents to intervene in custody proceedings in dissolution cases. A search for cases on this topic, if done with absolute precision, may yield dozens, if not hundreds, of cases, not what you really want or need. In this instance, a more useful approach would be to consult secondary sources, such as handbooks or treatises, that not only discuss the leading cases in the field but also summarize and analyze what these cases mean to the practitioner. Full text case law databases themselves are only part of what researchers need to complete their research.

Third, in order for online databases to be efficiently used, each document, as well as its sections, parts, words, letters, etc., should be indexed and tagged with what’s known in the computing world as meta-data. Indexing on this scale is massive and extremely complex but can make the development of search engines designed to work with these huge databases much more efficient. This is why Westlaw’s and Lexis’s search engines are so useful. Each company runs the full text of each case contained in their databases through extensive indexing and tagging. Part of this process eliminates repetitive words that have no legal meaning, such as articles, conjunctions, etc. Further, the content of the cases is divided into sections, such as majority and minority opinions, jurisdictions, etc., that dramatically helps narrow the search results. Indexing and tagging on this scale is a very costly venture, leaving only Lexis and Westlaw dominating the field. The process also is so complex that each company’s processes are highly guarded trade secrets. The Lexis and Westlaw case law databases are comprised entirely of public domain materials, but they still are extremely expensive to use. Each company cites the high cost of thorough indexing, tagging and sorting as a rationale to charge high prices for access.

A Solution to a Problem
The “Leading Case Service” may be a means of leveling the playing field for newcomers to the online database market, or for existing services that offer access to case law for free. The theory behind the project is that among the ten million cases in the American legal system, there is a relatively small percentage of cases that are considered more significant and interesting than the rest. If these cases can be identified and a search tool developed to exploit them, it may make searching case law more efficient by helping researchers focus on the most important cases first, before moving into the vast body of case law to find newer cases, cases with variant facts or those more specific to a specific jurisdiction.

The first step is to determine if there is such a thing as a group of “leading cases” and, if there is, to figure out how to find them and use them. The theory at present is that this group of cases can be found in the body of secondary materials. Initially, I thought that we could find leading cases in footnotes and body of treatises, presuming that treatise writers would discuss or cite to only the most important cases in their fields. To gather this group of leading cases, we could “mine” treatises and discover the cases cited in them. However, there are significant challenges to mining treatises for the cases the authors have cited, not the least of which is that treatises are published in many different formats and by enough publishers to make it difficult to use a single system to acquire the desired information. I’ve been convinced to put this step on hold; at least for the moment.

Our thinking at present is that we may have better luck focusing our efforts on cases cited in law review articles. There are two reasons that we think that law review articles may be better sources with which to discover this body of leading cases. First, we presume that the writers of law review articles as experts in their fields, are vigilant in identifying important cases in those fields and that overall, these scholars will discuss all the important cases in American law. I realize that this is a strong presumption, but over the last century, virtually all significant developments in law have been discussed and debated at length in law reviews and law journals. It follows, then, that the cases cited by the writers should be the ones most important or significant for one reason or another and can be identified as “leading cases,” those that researchers should read or at least be aware of when researching case law in that field.

A second advantage of using law review articles to identify leading cases is that the body of scholarship is continually expanding. If my theory is correct and we can discover this body of case law, we may be able to create an automated process that will continually add to the corpus of leading cases.

Questions
There are many questions to be answered. The most interesting question, and the one that I’ll be spending my time exploring initially is, exactly how many cases are cited by law review articles?

We know that there are around ten million cases published in total, but we don’t know what percentage of those cases found their way into the footnotes and text of law review articles. We are very close to obtaining the tools to answer this question conclusively. Hunches about the percentage of cases discussed in law reviews ranges from 5% to less than 1%, between 100,000 and 500,000 cases. If this is true, then full text case law database searching should be greatly improved by the mere fact that the researcher would be searching in a database of two or three hundred thousand cases instead of ten million!

Assuming that our initial tests reveal that there is, indeed, a body of leading cases that we can identify, many interesting possibilities emerge. The cases themselves may be ranked based upon the numbers of law reviews or journals that have cited them. (This is sort of a twist on Shepard’s service for law review articles. Instead of Shepardizing articles to find cases that cite to the articles, we’re looking for cases cited in the articles themselves.) Other information that may help rank the value of the cases includes the standing of the journal itself in which the article is published, or the reputation, publishing record or school of residence of the author.

Even if we are successful in identifying this corpus of leading cases, we have yet to determine how they should be used. The options are to create a separate database or to use meta-data to tag, or identify the cases so that search engines will be able identify the leading cases from among the rest of the millions of cases in the corpus of American law. Depending on the tags used, the researcher can use this information to sort search results in interesting and valuable ways. For example, a researcher desiring to know what is the law in an area novel to him, could begin with a full text case law database and immediately identify the most important cases in the field. After perusing these cases, links and metadata could then be used to immediately find articles, blogs and other pertinent online materials.

The goal of the project is to create a new way of using online digital legal materials. New technologies have allowed us to think of combining information in ways that were unheard of, even unthinkable, before today. “Leading Case Service” is essentially a ‘mash-up’ of online case law databases and online databases of law review articles. To this mash-up, colleagues have suggested that we may be able to add blogs, digital commons, wire-services, websites, legal periodical indexes and possibly treatises. The use of this information is not merely academic. It may also prove to be a way to power new search engines or discover new ways that various parts of the conceptual, scholarly world of the law influence each other.