Friday, October 14, 2011

A New Mode of Full-text Case Retrieval - a work in progress

[This past academic year, John Palfrey, Professor of Law, Vice Dean, Library and Information Resources, Faculty Co-Director, Berkman Center For Internet and Society at Harvard Law School, was intrigued by an idea that I’ve been kicking around for several years and invited me to come to Harvard to work on it.

With the support of an incredibly talented staff in my home library, I felt comfortable taking a semester off. And so, I am visiting, as an Academic Fellow at Harvard Law School Library’s Innovation Lab for the 2011 fall semester.]

Designing a Solution to a Problem
The project that I’ve been asked to explore has to do with the inherent challenges of conducting case law research using full text online databases. The working title of the project is “Leading Case Service” and is designed to make online case law research more productive and more efficient. There are three factors that make online case law research very difficult.

First is the size of the database. It is estimated that there are approximately ten million published cases in the American legal system. The size of the database alone poses very serious difficulties for designers of search engines and indexing systems, both digital and analog.

The size of the corpus of case law in the American legal system isn’t merely the result of our society’s litigious nature. Prior to the mid-nineteenth century, the publication of cases was done very judiciously. Most cases were published in selective case reporters that only published leading cases. In fact, the most influential American case reporter in the nineteenth century was the predecessor to what we know today as American Law Reports, or ALR, and it only published cases of some particular significance, either because the opinion made a ruling on a novel aspect of the law or clarified an issue that had been dealt with by many courts with varying outcomes. In the late nineteenth century, the West Publishing Company entered the case law publishing market and effectively turned cases into a commodity. The method by which West published cases was virtually indiscriminate because it published any and all cases submitted to it by the courts. Its business model was founded on the premise that the more cases it could publish, the better; the more cases it could publish, the more volumes it could sell.

As the volume of cases it published grew, West developed an elaborate subject indexing system to help researchers. We know the indexing system as the Key Number System, and the index as the West Digest System. Today, the index alone numbers several thousand volumes! Coupled with Shepard’s citations, the digest and case-verification systems helped researchers identify both cases that were useful and those that were “still good law,” in the sense that they hadn’t been specifically overruled by another court. This system was extremely accurate, thorough, and objective but still left the researcher with a serious problem of having to wade through a substantial mass of material. The comprehensiveness of the West National Reporter System, its Digests and Shepard’s meant that the cases discovered on any one particular topic could number in the thousands.

The enormous volume of case law poses difficulties for researchers for another reason. Important research in the field of information science that explains that, due to the vagaries of language and other empirical laws of linguistics, full text database searching is by definition inefficient, even in databases filled with documents of a professional nature and highly specialized vocabulary, such as law. Studies have shown that the best a full text search engine is capable of retrieving amounts of only about twenty percent of the relevant documents on a topic. With 10 million published cases, even a 20% efficiency yields far more cases than any person can reasonably be expected to read.

Second, full text databases are objective search tools. This makes full text case law databases very difficult places for researchers to go to find answers about the law. For instance, let’s say you want to know what the law is on the rights of grandparents to intervene in custody proceedings in dissolution cases. A search for cases on this topic, if done with absolute precision, may yield dozens, if not hundreds, of cases, not what you really want or need. In this instance, a more useful approach would be to consult secondary sources, such as handbooks or treatises, that not only discuss the leading cases in the field but also summarize and analyze what these cases mean to the practitioner. Full text case law databases themselves are only part of what researchers need to complete their research.

Third, in order for online databases to be efficiently used, each document, as well as its sections, parts, words, letters, etc., should be indexed and tagged with what’s known in the computing world as meta-data. Indexing on this scale is massive and extremely complex but can make the development of search engines designed to work with these huge databases much more efficient. This is why Westlaw’s and Lexis’s search engines are so useful. Each company runs the full text of each case contained in their databases through extensive indexing and tagging. Part of this process eliminates repetitive words that have no legal meaning, such as articles, conjunctions, etc. Further, the content of the cases is divided into sections, such as majority and minority opinions, jurisdictions, etc., that dramatically helps narrow the search results. Indexing and tagging on this scale is a very costly venture, leaving only Lexis and Westlaw dominating the field. The process also is so complex that each company’s processes are highly guarded trade secrets. The Lexis and Westlaw case law databases are comprised entirely of public domain materials, but they still are extremely expensive to use. Each company cites the high cost of thorough indexing, tagging and sorting as a rationale to charge high prices for access.

A Solution to a Problem
The “Leading Case Service” may be a means of leveling the playing field for newcomers to the online database market, or for existing services that offer access to case law for free. The theory behind the project is that among the ten million cases in the American legal system, there is a relatively small percentage of cases that are considered more significant and interesting than the rest. If these cases can be identified and a search tool developed to exploit them, it may make searching case law more efficient by helping researchers focus on the most important cases first, before moving into the vast body of case law to find newer cases, cases with variant facts or those more specific to a specific jurisdiction.

The first step is to determine if there is such a thing as a group of “leading cases” and, if there is, to figure out how to find them and use them. The theory at present is that this group of cases can be found in the body of secondary materials. Initially, I thought that we could find leading cases in footnotes and body of treatises, presuming that treatise writers would discuss or cite to only the most important cases in their fields. To gather this group of leading cases, we could “mine” treatises and discover the cases cited in them. However, there are significant challenges to mining treatises for the cases the authors have cited, not the least of which is that treatises are published in many different formats and by enough publishers to make it difficult to use a single system to acquire the desired information. I’ve been convinced to put this step on hold; at least for the moment.

Our thinking at present is that we may have better luck focusing our efforts on cases cited in law review articles. There are two reasons that we think that law review articles may be better sources with which to discover this body of leading cases. First, we presume that the writers of law review articles as experts in their fields, are vigilant in identifying important cases in those fields and that overall, these scholars will discuss all the important cases in American law. I realize that this is a strong presumption, but over the last century, virtually all significant developments in law have been discussed and debated at length in law reviews and law journals. It follows, then, that the cases cited by the writers should be the ones most important or significant for one reason or another and can be identified as “leading cases,” those that researchers should read or at least be aware of when researching case law in that field.

A second advantage of using law review articles to identify leading cases is that the body of scholarship is continually expanding. If my theory is correct and we can discover this body of case law, we may be able to create an automated process that will continually add to the corpus of leading cases.

Questions
There are many questions to be answered. The most interesting question, and the one that I’ll be spending my time exploring initially is, exactly how many cases are cited by law review articles?

We know that there are around ten million cases published in total, but we don’t know what percentage of those cases found their way into the footnotes and text of law review articles. We are very close to obtaining the tools to answer this question conclusively. Hunches about the percentage of cases discussed in law reviews ranges from 5% to less than 1%, between 100,000 and 500,000 cases. If this is true, then full text case law database searching should be greatly improved by the mere fact that the researcher would be searching in a database of two or three hundred thousand cases instead of ten million!

Assuming that our initial tests reveal that there is, indeed, a body of leading cases that we can identify, many interesting possibilities emerge. The cases themselves may be ranked based upon the numbers of law reviews or journals that have cited them. (This is sort of a twist on Shepard’s service for law review articles. Instead of Shepardizing articles to find cases that cite to the articles, we’re looking for cases cited in the articles themselves.) Other information that may help rank the value of the cases includes the standing of the journal itself in which the article is published, or the reputation, publishing record or school of residence of the author.

Even if we are successful in identifying this corpus of leading cases, we have yet to determine how they should be used. The options are to create a separate database or to use meta-data to tag, or identify the cases so that search engines will be able identify the leading cases from among the rest of the millions of cases in the corpus of American law. Depending on the tags used, the researcher can use this information to sort search results in interesting and valuable ways. For example, a researcher desiring to know what is the law in an area novel to him, could begin with a full text case law database and immediately identify the most important cases in the field. After perusing these cases, links and metadata could then be used to immediately find articles, blogs and other pertinent online materials.

The goal of the project is to create a new way of using online digital legal materials. New technologies have allowed us to think of combining information in ways that were unheard of, even unthinkable, before today. “Leading Case Service” is essentially a ‘mash-up’ of online case law databases and online databases of law review articles. To this mash-up, colleagues have suggested that we may be able to add blogs, digital commons, wire-services, websites, legal periodical indexes and possibly treatises. The use of this information is not merely academic. It may also prove to be a way to power new search engines or discover new ways that various parts of the conceptual, scholarly world of the law influence each other.

11 comments:

KenHirsh said...

Time will tell whether this is a valid method for helping improve legal research and understanding of the law, but it certainly is an innovative idea and I look forward to the results.

Anonymous said...

Precydent was a somewhat similar effort back in 2006-2009. The project folded, but the law prof who did the research is surely still around. Here's a link.

http://lawprofessors.typepad.com/law_librarian_blog/2008/01/law-prof-as-too.html

John Mayer
jmayer@cali.org

Anonymous said...

Also check out ..

http://taxprof.typepad.com/taxprof_blog/2006/05/the_most_heavil.html

..and..

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=785826

..where Adam Steinman finds the 1000 most cited cases as part of the paper cited above.

Frank Bennett said...

Interesting! This would tie in nicely with Zotero/CSL legal support for authoring @ http://citationstylist.org/

Robert Richards said...

Suggest looking at Katz & Bommarito's work on judicial citation networks: http://computationallegalstudies.com/tag/judicial-citation-network/ (scroll down)

Anonymous said...

Rich:

This is the sort of thing that would work well as a metadata model expressed in RDF, one that comprehends both the cases and the law review articles, and makes use of categorization schemes that might arguably be applied to each.

Karl Gruben said...

Rich,
Interesting concepts - a leap back to the future with the concept of leading cases (ALR) vs all cases (NRS). Hmmm. Isn't that what Westlaw Next is doing with its social context component - if more people who know what they are doing cite this case then it must be important and the algorithm will bubble it up higher in the returns?
Karl Gruben

Frank Bennett said...

Echoing what Tom Bruce says above. Zotero can digest RDF, and it would be great to have. I'm not much of an RDF jockey (and need to read up), but it might make sense to separate the details needed for building cites and the tags etc needed for categorization into separate ontologies. There are models about that can be used for the former.

Dick Danner said...

Rich: As always your work is thought-provoking and suggestive. A couple comments:

For one, it is surprising to me, given the size of the law review literature, that the number of cases discussed would be as small as you think it might. This in itself would be a rather significant finding. (I am not clear from your posting, though, about the make-up of your database of “law reviews” or whether it includes student pieces as well as those written by faculty or other scholars.)

Also, you might want to think about what value is provided by ranking the cases. Ranking is such a tricky business in general and it isn’t clear to me what rankings by any of the factors you mention tell us that would be more than “interesting”.

Good luck! I look forward to your next reports.

Unknown said...

Dick, We should have a better picture of the numbers of cases in a few weeks. I'm curious about that, too. The question about student written pieces, also raises and interesting question. It was common practice in earlier journals to have a section called, "Recent Cases," or something like that. I have to look more closely at those to see if they're worth including. But modern case notes are sometimes quite thoughtful and may be worth including. We'll have to see what we find in terms of numbers.

Here's my thinking about the ranking: If a legal issue is discussed or debated between scholars in a series of articles and the associated cases get discussed in three or four articles, then they would be more valuable than other cases on the same topic that aren't discussed in the articles in the series. That's about all this ranking would tell a researcher. But that may be good information for the researcher to have.

Ed Walters said...

I'm really excited to hear more about your work this year, Rich! A few thoughts:

1. Why not include citations from other judicial opinions in the analysis? Citations in law reviews might show an academic interest in cases with constitutional issues -- where citations by other judges in opinions might correlate better with precedential authority.

We do something like this at Fastcase, integrating citation analysis in the results, to cite more authoritative cases to the top of results.

2. You might want to take a look at some of the cloud printing tools we've built to use in your research. They extract all the citations from Web pages, Word documents, etc., so you could quickly extract cited cases. (Most people use them for one-click batch printing of all cited cases, but I suppose you could use cloud printing for citation extraction as well.)

Please let me know if we can support your work this coming year - very interesting stuff!