Showing posts with label Lexis. Show all posts
Showing posts with label Lexis. Show all posts

Friday, October 14, 2011

A New Mode of Full-text Case Retrieval - a work in progress

[This past academic year, John Palfrey, Professor of Law, Vice Dean, Library and Information Resources, Faculty Co-Director, Berkman Center For Internet and Society at Harvard Law School, was intrigued by an idea that I’ve been kicking around for several years and invited me to come to Harvard to work on it.

With the support of an incredibly talented staff in my home library, I felt comfortable taking a semester off. And so, I am visiting, as an Academic Fellow at Harvard Law School Library’s Innovation Lab for the 2011 fall semester.]

Designing a Solution to a Problem
The project that I’ve been asked to explore has to do with the inherent challenges of conducting case law research using full text online databases. The working title of the project is “Leading Case Service” and is designed to make online case law research more productive and more efficient. There are three factors that make online case law research very difficult.

First is the size of the database. It is estimated that there are approximately ten million published cases in the American legal system. The size of the database alone poses very serious difficulties for designers of search engines and indexing systems, both digital and analog.

The size of the corpus of case law in the American legal system isn’t merely the result of our society’s litigious nature. Prior to the mid-nineteenth century, the publication of cases was done very judiciously. Most cases were published in selective case reporters that only published leading cases. In fact, the most influential American case reporter in the nineteenth century was the predecessor to what we know today as American Law Reports, or ALR, and it only published cases of some particular significance, either because the opinion made a ruling on a novel aspect of the law or clarified an issue that had been dealt with by many courts with varying outcomes. In the late nineteenth century, the West Publishing Company entered the case law publishing market and effectively turned cases into a commodity. The method by which West published cases was virtually indiscriminate because it published any and all cases submitted to it by the courts. Its business model was founded on the premise that the more cases it could publish, the better; the more cases it could publish, the more volumes it could sell.

As the volume of cases it published grew, West developed an elaborate subject indexing system to help researchers. We know the indexing system as the Key Number System, and the index as the West Digest System. Today, the index alone numbers several thousand volumes! Coupled with Shepard’s citations, the digest and case-verification systems helped researchers identify both cases that were useful and those that were “still good law,” in the sense that they hadn’t been specifically overruled by another court. This system was extremely accurate, thorough, and objective but still left the researcher with a serious problem of having to wade through a substantial mass of material. The comprehensiveness of the West National Reporter System, its Digests and Shepard’s meant that the cases discovered on any one particular topic could number in the thousands.

The enormous volume of case law poses difficulties for researchers for another reason. Important research in the field of information science that explains that, due to the vagaries of language and other empirical laws of linguistics, full text database searching is by definition inefficient, even in databases filled with documents of a professional nature and highly specialized vocabulary, such as law. Studies have shown that the best a full text search engine is capable of retrieving amounts of only about twenty percent of the relevant documents on a topic. With 10 million published cases, even a 20% efficiency yields far more cases than any person can reasonably be expected to read.

Second, full text databases are objective search tools. This makes full text case law databases very difficult places for researchers to go to find answers about the law. For instance, let’s say you want to know what the law is on the rights of grandparents to intervene in custody proceedings in dissolution cases. A search for cases on this topic, if done with absolute precision, may yield dozens, if not hundreds, of cases, not what you really want or need. In this instance, a more useful approach would be to consult secondary sources, such as handbooks or treatises, that not only discuss the leading cases in the field but also summarize and analyze what these cases mean to the practitioner. Full text case law databases themselves are only part of what researchers need to complete their research.

Third, in order for online databases to be efficiently used, each document, as well as its sections, parts, words, letters, etc., should be indexed and tagged with what’s known in the computing world as meta-data. Indexing on this scale is massive and extremely complex but can make the development of search engines designed to work with these huge databases much more efficient. This is why Westlaw’s and Lexis’s search engines are so useful. Each company runs the full text of each case contained in their databases through extensive indexing and tagging. Part of this process eliminates repetitive words that have no legal meaning, such as articles, conjunctions, etc. Further, the content of the cases is divided into sections, such as majority and minority opinions, jurisdictions, etc., that dramatically helps narrow the search results. Indexing and tagging on this scale is a very costly venture, leaving only Lexis and Westlaw dominating the field. The process also is so complex that each company’s processes are highly guarded trade secrets. The Lexis and Westlaw case law databases are comprised entirely of public domain materials, but they still are extremely expensive to use. Each company cites the high cost of thorough indexing, tagging and sorting as a rationale to charge high prices for access.

A Solution to a Problem
The “Leading Case Service” may be a means of leveling the playing field for newcomers to the online database market, or for existing services that offer access to case law for free. The theory behind the project is that among the ten million cases in the American legal system, there is a relatively small percentage of cases that are considered more significant and interesting than the rest. If these cases can be identified and a search tool developed to exploit them, it may make searching case law more efficient by helping researchers focus on the most important cases first, before moving into the vast body of case law to find newer cases, cases with variant facts or those more specific to a specific jurisdiction.

The first step is to determine if there is such a thing as a group of “leading cases” and, if there is, to figure out how to find them and use them. The theory at present is that this group of cases can be found in the body of secondary materials. Initially, I thought that we could find leading cases in footnotes and body of treatises, presuming that treatise writers would discuss or cite to only the most important cases in their fields. To gather this group of leading cases, we could “mine” treatises and discover the cases cited in them. However, there are significant challenges to mining treatises for the cases the authors have cited, not the least of which is that treatises are published in many different formats and by enough publishers to make it difficult to use a single system to acquire the desired information. I’ve been convinced to put this step on hold; at least for the moment.

Our thinking at present is that we may have better luck focusing our efforts on cases cited in law review articles. There are two reasons that we think that law review articles may be better sources with which to discover this body of leading cases. First, we presume that the writers of law review articles as experts in their fields, are vigilant in identifying important cases in those fields and that overall, these scholars will discuss all the important cases in American law. I realize that this is a strong presumption, but over the last century, virtually all significant developments in law have been discussed and debated at length in law reviews and law journals. It follows, then, that the cases cited by the writers should be the ones most important or significant for one reason or another and can be identified as “leading cases,” those that researchers should read or at least be aware of when researching case law in that field.

A second advantage of using law review articles to identify leading cases is that the body of scholarship is continually expanding. If my theory is correct and we can discover this body of case law, we may be able to create an automated process that will continually add to the corpus of leading cases.

Questions
There are many questions to be answered. The most interesting question, and the one that I’ll be spending my time exploring initially is, exactly how many cases are cited by law review articles?

We know that there are around ten million cases published in total, but we don’t know what percentage of those cases found their way into the footnotes and text of law review articles. We are very close to obtaining the tools to answer this question conclusively. Hunches about the percentage of cases discussed in law reviews ranges from 5% to less than 1%, between 100,000 and 500,000 cases. If this is true, then full text case law database searching should be greatly improved by the mere fact that the researcher would be searching in a database of two or three hundred thousand cases instead of ten million!

Assuming that our initial tests reveal that there is, indeed, a body of leading cases that we can identify, many interesting possibilities emerge. The cases themselves may be ranked based upon the numbers of law reviews or journals that have cited them. (This is sort of a twist on Shepard’s service for law review articles. Instead of Shepardizing articles to find cases that cite to the articles, we’re looking for cases cited in the articles themselves.) Other information that may help rank the value of the cases includes the standing of the journal itself in which the article is published, or the reputation, publishing record or school of residence of the author.

Even if we are successful in identifying this corpus of leading cases, we have yet to determine how they should be used. The options are to create a separate database or to use meta-data to tag, or identify the cases so that search engines will be able identify the leading cases from among the rest of the millions of cases in the corpus of American law. Depending on the tags used, the researcher can use this information to sort search results in interesting and valuable ways. For example, a researcher desiring to know what is the law in an area novel to him, could begin with a full text case law database and immediately identify the most important cases in the field. After perusing these cases, links and metadata could then be used to immediately find articles, blogs and other pertinent online materials.

The goal of the project is to create a new way of using online digital legal materials. New technologies have allowed us to think of combining information in ways that were unheard of, even unthinkable, before today. “Leading Case Service” is essentially a ‘mash-up’ of online case law databases and online databases of law review articles. To this mash-up, colleagues have suggested that we may be able to add blogs, digital commons, wire-services, websites, legal periodical indexes and possibly treatises. The use of this information is not merely academic. It may also prove to be a way to power new search engines or discover new ways that various parts of the conceptual, scholarly world of the law influence each other.

Sunday, March 07, 2010

Open Access Plus

The Fourth Rail of the Digital Revolution in Legal Materials

Much good work is being done to insure that as the internet develops and digital information becomes the norm, it remains freely accessible to all citizens. After all, how can citizens participate in their government if they can't have access to their own laws? Efforts by AALL, PublicResource.org and NCCUSL and others are focused primarily on making sure that all government and primary legal materials are free, reliable and that they are authentic. Again, how can citizens participate in their government if cost limits their access and they can't be assured that what they are accessing is the real thing? Law.gov, NCCUSL and AALL's Washington Affairs Office are working hard on all fronts, known collectively as "access, authentication and preservation."

I want to discuss the all but overlooked aspect of the digital revolution in legal materials: meaningful access to the the law. If we think of access, authentication and preservation as three legs upon which the ideals of "open access" stand, meaningful access as described below would constitute the fourth leg of the equation, without which all the access in the world may not be enough to truly address the needs of American citizens.

We librarians know good and well that the key to efficient, effective legal research is not finding cases and statutes. Rather, a skilled researcher knows which tools lead you to the right statutes and cases, and, preferably, especially if you're new to the subject, tools that also explain what is the 'law' of that subject. In this context, the 'law' is not merely a rule, but, a series of calculations and interpretations about what all the cases and statutes (and politicians and society in general?) say, and standards of practice or behavior that result, about the subject.

The debate, therefore, about free, unfettered access to primary legal materials is, therefore, something of a red herring. Access to the primary law is really secondary if the goal is to give citizens free, unfettered access to the 'law.' In this context, practical knowledge of the law can be described as the ability to predict outcomes of law suits, relational expectations or legal proceedings. This knowledge causes people to live and pursue livelihoods in accordance with legal standards.

As the body of primary legal materials grows and access to it spreads, what will be the result? Will citizens actually be better able to understand the law without access to the scholarship, analysis and the sophisticated objective finding tools of legal research?

In addition to advocating the free, unfettered access to primary laws, perhaps we should also focus our efforts toward using new technology to develop new finding tools and access to secondary materials.

I propose that the internet provides us with the means to create aggregated, federated meta-search engines that could mine legal scholarship and commentary found in emerging web-based resources such as digital commons, blogs, news and RSS feeds, Twitter feeds, podcasts, etc. We librarians are in a unique position to understand the "informatiosphere"; how it's structured, how to evaluate authenticity, authority and the 'new' provenance. There are many ways that search engines and search algorithms may be designed to provide access to new, free materials that make access to the law more useful, and, contrary to the prevailing commercial model, encourages the development of more free materials.

And herein lies the rub. in the coming 'digital age', one of its byproducts is the ability of commercial publishers to closely regulate access to various information sources. Commercial legal publishers' products rarely have value exclusively in the publication of primary legal materials. The value that commercial legal publishers offer lawyers and lay people interested in learning about he law lies in their secondary materials and finding tools. As open free access to primary materials becomes the norm, legal publishers will likely tighten the circle around their proprietary commercial products. As their income declines from the sale of primary materials, which most also publish in addition to secondary resources, these corporations will make up the difference by increasing the prices of finding tools, treatises, form books, looseleaf reporters, etc. As the print versions of these secondary resrouces disappear from library shelves, access to them by lay people will be all but blocked because most cannot afford access to online products produced by the major legal publishers.

When efforts to make access to primary legal materials free succeeds, it is possible that only legal professionals will have access to commercially produced finding tools and secondary materials. As described earlier, these may actually be the most important materials to which people interested in learning the law must have access in order to equip them to make reasoned, legal decisions about their lives and livelihoods.

Should this come to pass, if we fail to provide to ordinary citizens access to some form of secondary materials that help them find and understand the law, our success in providing them with free, unfettered access to primary materials may, in the end be a pyrrhic victory.

Tuesday, February 16, 2010

Some Issues Answered: West Explains and Raises questions....

3 Geeks and a Law Blog: WestlawNext - Some Issues Answered published an email that Anne Ellis, Senior Director, Librarian Relations, at West, distributed to many AALL listservs this week.

Just beneath the surface of all the hub-bub surrounding the roll-out of WestlawNext (WLN), is an unanswered question regarding the structure and nature of the new search engine. West doesn't seem to be very forthcoming about what it is other than to say that it is more than just a new interface on the same old product. It is, apparently more that simply new window dressing on WIN. It is also more than simply taking searches, analyzing them and then searching through West's vast universe of secondary materials. There is an aspect of the searching process (dare we call it "algorithm"?), apparently, where the users themselves actually contribute to the ranking/value of specific documents in Westlaw's database, be they primary or secondary law.

Indeed. This is essentially how Google has built it's search engine hegemony. Essentially, users "vote" for results with their clicks. (Of course Google makes money by selling votes to businesses that want to be top of any search list. The ramifications for law makes one think of a Grisham novel....) Is this really what WLN is all about? Is crowd-sourcing the law really good for the law? For researchers?

I wonder.

Sunday, December 06, 2009

Chat Room Transcript from 4 December 2009 BlogTalkRadio Show

I will be writing later about our conversation with Anurag Acharya, Chief Engineer of Google Scholar. Greg Lambert, Roger Skalbeck, Marcia Dority Baker and I had a wonderful 90-minute conversation with Mr. Acharya, and I think that I speak for us all when I say that the conversation was not only enlightening, but we were all very impressed with Mr. Acharya's charm, his sense of humor and the great delight he has in his work. Like I said, more on that later. For the time being, please feel free to peruse the transcript of the chat room. And, as always, remember that you can download the show on iTunes or from the show's website: http://blogtalkradio.com/thelawlibrarian.

A couple of notes about the show, for anyone interested: we had a record of 301 live listeners and 105 people in the chat room! Over the weekend, there have already been nearly 140 downloads of the show. Thanks everyone who participated. We've a lot of exciting shows planned for 2010. We'll be taking the holidays off, but plan to return on January 15. At that time, we'll begin our new schedule of recording/airing twice a month on the first and third Fridays of the month.

Until then, please, everyone, please have a safe and happy holiday season. And for all the crew at The Law Librarian on BTR, we'd love to hear from you your ideas and thoughts about how we may improve.

Tuesday, November 17, 2009

Google Scholar - (Almost) Great Free Legal Search

Amazing. Google has made a giant step toward creating a practical search engine of legal materials. Click on the link above the check it out. Google's new Legal Opinion and Journals (LOJ) is not a Wexis, or VHPPLM killer. It is a game changer in the "free law," community.

Here are a few initial comments about it. First, it is still classically a Google product. By this I mean that they spend little time working on user interface. It is what it is. We tend to forgive Google for all it's faults because it simply has little competition and it's so quick, easy to use that the annoyances of the way the search results and options are presented to you are forgiven. It's quick and easy. Forget the clutter.

Second, it's amazingly snappy. Searches on any topic I threw at it, in any combination of databases were returned in the blink of an eye.

Third, the "How Cited," tab is fascinating and provides quick access to raw citation information on the case. Like everything Google, there's little help distinguishing one cite from another, but there is help. And the information provided is good. The speed of the Google engine can make drilling down to particulars very quick - even if it means that you have to wade through hundreds of cases. Clicking from case to "How Cited" tab, to case, one can quickly get lost, but if you keep your wits about you, you can learn some interesting things about the case you're researching.

Fourth, it is unclear just exactly what you're searching when you use Google Scholar's Legal Opinions and Journals, (GSLOJ). When you click on the Advanced Search link, you get choices of, "Search all legal opinions and journals," or searching only Federal opinions or individual state court opinions. State court opinions can be searched in any combination, just by clicking boxes and selecting the states that you want to search. Trouble is, there is no description of what library of journals is being searched, or what are the years of coverage for case databases. Do the Nebraska cases, for example, go back ten years, twenty, or two. It's hard to say.

Fifth, there are no statutes or regulations to be found in/on LOJ. What's with that?!

Sixth, you can't search only Law Journals. With the growing movement to develop digital commons, and to move law reviews to the web, it would be immensely helpful to be able to mine this vein of secondary material.

Overall, Google Scholar's new LOJ is a welcome entry into the free online legal research community. I don't think that West or Lexis have much to worry about, but LII, Justicia, et al, may have "competition."

What impact this will have on Law.Gov, "Free Law," and kerfuffles? This is certainly a game changer.

For the Official Explanation: http://googleblog.blogspot.com/2009/11/finding-laws

Monday, November 16, 2009

New Concept in Database Search Engines

I have been thinking about this concept for about a year, and I can't get it out of my head. It's time to share it. I hope that Google, CCH or BNA reads it, exploits it and sends me a hefty check....

Why online haven't legal database providers figured out that online databases are a new breed of legal research tool and developed something completely different? To date, all online databases are not much more than online versions of their old-fashioned print tools. There are differences, of course: Online searching allows users to find particular cases and documents quickly, sort rapidly and print more cleanly, but in reality, online tools do no more than allow users to skate around through masses of undifferentiated primary law, using cite-verification tools to sift through the mass of material fairly quickly. But without much help or guidance.

I propose development of a new kind of online search engine. First, let's establish a few assumptions. First, let's presume that cases cited by treatises, law review, blog writers and commentators are cases that are most important than cases that are not cited by these writers. Second, let's presume that cases cited more frequently are more important than less cited cases. Third, it is possible to make assumptions about the relative value of a case based upon the kinds of works a case is cited in, as well as the kind of treatment that a case receives in that work.

Based upon these three assumptions, I think that it is possible to develop a database(s) that is comprised of only cited cases. What's more, meta-data can be created that will note where it was cited, and the level of treatment.

There are at least six great sources from which you can build such databases. West has, perhaps the greatest library from which to build such a database. It's collection of secondary materials is tremendous. Lexis is also well-positioned to accomplish something like this with its Matthew Bender titles. But, perhaps the two companies best equipped to build such a high performance database are CCH and BNA. These companies own some of the very best specialized law treatises. It's nice for these companies to put their newsletters and looseleafs in electronic format, but, to paraphrase early library automation consultants, "an electronic version of a good looseleaf only creates a good electronic looseleaf." In other words, it doesn't make a good thing better; it only makes it electronic. In order to make a good thing great, it must be different. (That should be obvious, but somehow it's not….)

But what if you're not West, Lexis, BNA or CCH? Are you out of luck? I don't think so. There are two resources left. First, Hein Online is now comprised of an unprecedented collection of law reviews. This is a vast gold mine of notable cases. Hein itself could develop a search engine that sifts through the very best cases based on citation frequency among law review writers.

A newly emerging resource that may accomplish roughly the same thing, are digital commons and blogs. Looking forward, a crawler could be designed that will crawl through digital commons, legal blogs and law review websites looking for cited cases. Here, the presumption is that cases that are discussed by more writers are more significant.

Finally, it is possible that such as database could be made simply from cases cited by other cases. It can be presumed that cases that are cited by other cases most frequently are those cases that are more significant legal precedents.