The Advanced Research Projects Agency (ARPA) Free Downloads Any

In 1992 the Advanced Research Projects Agency (ARPA) funded a three year grant to

investigate the questions related to large-scale, distributed, digital libraries. The award

focused research on Computer Science Technical Reports (CS-TR) and was granted to the

Corporation for National Research Initiatives (CNRI) and five research Universities.

The ensuing collaborative research has focused on a broad spectrum of technical, social,

and legal issues, and has encompassed all aspects of a very large, heterogeneous

distributed digital library environment: acquisition, storage, organization, search,

retrieval, display, use and intellectual property. The initial corpus of this digital

library is a coherent digital collection of CS-TRs created at the five participating

universities: Carnegie Mellon University, Cornell University, Massachusetts Institute of

Technology, Stanford University, and the University of California at Berkeley. The

Corporation for National Research Initiatives serves as a collaborator and agent for the

project.

If we accept that we are living in the Information Age and that a

challenge for this age is to give people tools with which they can successfully use

networked information, then librarians and computer scientists are natural

-- collaborators to address this challenge. Computer scientists and librarians each bring to

the discussion complementary technical skills and perspectives. Computer scientists

have a large view of the network, new approaches to information retrieval, and an

openness to change. Librarians have content, and a historical, enduring view regarding

service and responsibilities for our intellectual heritage. Both communities share the

academic values of sharing openly and the desire to foster the creation of new more

powerful knowledge. In this project, the librarians have benefited from the computer

scientist’s cultural value of exploration and learning by doing. The computer scientists

have benefited from the librarian’s broad perspective and integrative skills. The

coupling of content and carrier, scale, inter-operability, and mutual respect for

professional knowledge and abilities has served to create a productive, dynamic

atmosphere.

The project testbed supports both service and ongoing

experimentation. While the prototype service is available now for public use, the

testbed and its services are also continuously changing. This CS-TR project highlights

the tension between providing reliable services while experimenting with new

capabilities. Moving into the future and contributing to new arenas of digital

3information while maintaining perspective and providing daily services are challenges

for individual librarians and innovative library organizations. In the CS-TR project,

librarians have continuously examined the long term viability of the effort. At each

stage of the effort, it has been important to remember the research nature of the project

and that digital libraries are in their nascent state. Whatever we build today will be

superseded by more powerful knowledge and services in the future.

Discussions for the CS-TR project began in 1990 and evolved finally into the structure

in place today. The original question posed for the project was straight forward: how

can we make computer science technical reports more accessible to researchers?

Computer Science Technical Reports are an important body of knowledge, they are often

difficult to locate because they are normally published by the academic/research

departments, and we believed that the intellectual property issues were not terribly

complex. Through the early discussions among the participating institutions the horizon

of the issues expanded and this broadened view was presented to potential funding

agencies. With ARPA funding in 1992 and CNRl’s role as contract administrator, it

became apparent that we had the potential to set the pace for several important pieces of

the digital library: distributed, virtual collections spread across the network,

development of sophisticated linking mechanisms that would enable the location and

retrieval of information no matter where located, incorporation of mechanisms to handle

intellectual property issues in a digital environment, and finally, better understanding

The project’s core design is based upon the construction of a bibliographic records

database that describe the TR’s and enable linkage to the page images of those TR’s. The

concept of the database has been debated over the course of the project, should it be

centralized and replicated at each site, or should it be distributed where each site

maintains the index record only for its own collection? The nature of the linking

mechanism between the record and the images has been a topic of lively discussion and

development. We must assume that the TR bibliographic record will be stored in a

different location from the page images and that both the records and the images may

move to other machines during their lifetimes. What linking mechanism will support

this location flexibility and maintain high, efficient, performance?

In addition to images, project staff also experimented with the full text of the TR’s,

obtained from the source files of the TR or through OCR techniques on the images.

Together, these files will enable exploration and evaluation of: full text retrieval

mechanisms;

The group discussed alternatives. We wanted a simple format, for people and for

machines; one that was easy to read (“human readable”) and easy to create. (These

bibliographic records are usually produced by secretaries or publications

coordinators). We knew we were possibly choosing an interim format as automatic and

full-text indexing methods may supersede bibliographic records.

Using USMARC (US Machine Readable Cataloging), prevalent in library cataloging

process, was considered early in the project and discarded. USMARC is very complex, is

not easily taught, nor is it accepted by non-catalogers. Project staff were concerned that

the complexity and the high level of training necessary to catalog in USMARC may cause

significant time delays between TR publication and bibliographic record. For this CS-TR

project, the possibility of a delay was unacceptable.

Once the bibliographic record format was created the discussion turned to centralized vs.

distributed indexes. Long conversations ensued where the participants argued the

virtues, value, and scalability of centralized and/or decentralized indexes for very large

distributed collections.

One of the early goals of the project was to develop an inter-operable, distributed

collection whereby each site would develop its own testbed architecture, create

consistent content based on the G4-tiffb standard, and then experiment with

interoperating and sharing those collections across different systems. In the end no

conclusions were reached and the above goal was not met. We know that neither

centralized nor decentralized servers will scale. Eventually a more complicated, yet to

be determined, architecture may emerge which will involve replication of an

institution’s indexes on several servers around the country. This effort will require

more research and a lot of cooperation between insitutions.

In order to get started, Cornell developed Dienst, which is a protocol and

implementation that provides Internet access to our distributed collections. The indexes

are produced and kept at each institution. Each institution is required to run the Dienst

server protocol.

In the Dienst architecture there are four classes of services. A Repository Service

stores digital documents, each of which has a unique name and may exist in several

different formats. An Index Service server searches a collection and returns a list of

documents that match the search. A single, centralized Meta Service (also called a

Contact Service) provides a directory of locations of all other services. Finally, a User

Interface service mediates human access to this library. All these services communicate

via the Dienst protocol. [Dienst Web page at http://www.ncstrl.org]

A group of sites sharing the Dienst protocol form a single distributed collection. Each

site will typically run repository, index, and UI services for documents issued by that

site. One of the sites will run a Meta service, thus defining the set of sites that make up

the collection.

The pros and cons of a standardized format (images, SGML, Postscript, ASCII) for the

technical report documents was vigorously debated; the outcome? Tiff-b image format

(also called Group IV FAX compression in Tiff format) was selected as the project

standard. This decision was supported by at least the following factors: in 1992, image

formats were standard and many commercial software packages were available on

multiple platforms; retrospective paper reports could be converted to image format;

project participants were eager to populate servers with both retrospective and

prospective reports; and researchers did NOT want to spend resources on document mark

up, document conversion or on developing new standards. Two faculty members of the

consortium believe that images should remain the ultimate version of record because

they provide the simpliest exact representation of the document and can be exported to

new software and platforms over time. In brief, the consortium chose to try and

populate the architectures with content rather than trying to solve the issue of how to

format the content.

The MIT Library 2000 testbed effort focused significant attention on production

scanning. This emphasis is based upon the hypotheses that scanned images of documents

will be an important component of any future electronic environment. At its core, the

digital library must contain high quality content, and, for the foreseeable future, much

of that content will come from the conversion of paper format information to scanned

images. Further, the creation of a large corpus of quality information provides the

testbed content for investigations into system architectures, electronic information

management, retrieval, and long-term storage. Basic principles of the MIT scanning

effort include:

1. Materials should only be handled once. The design of the scanning

environment should strive to achieve the greatest advantage in terms of price,

performance, and quality. Libraries and publishers cannot afford to re-scan materials

as technological capability increases. For the original paper artifact, scanning once is

also preferable. To adhere to this principle, good paper workflow, management, and

content selection is important.

2. Scanning should capture as much information as possible in the single scan

principle. Current technology cannot exploit all of the bits captured, future

technologies, however, will be able to exploit all nuances of the captured information.

The MIT scanners are capable of a resolution of 400 pixels per inch, with eight bits of

gray-scale per pixel. These create very large files (about 16 Mbytes per scanned

page), which are rendered down to the agreed-upon interchange format for the project:

300 dpi, one bit per pixel, in CCITT Group IV FAX compression in TIFF format.

3. Quality control is critical. In order to achieve the first two principles,

quality control methods must assure a high degree of integrity and confidence in the

production environment. The MIT Libraries’ Document Services has adapted procedures

from its micro-reproduction heritage for this new production scanning effort.

4. Context of the images is important now and in the future. Because the

underlying technologies will change and improve in the future, the CS-TR scanned

images must provide enough context for humans and machines to understand both their

content and structure in order to use them effectively. The MIT scanning effort has

created a metadata record to provide information about the scanned document and the

environment in which it was created. This record specifies both the form and content of

the information that must be captured when a document is scanned, and becomes a

co’mponent of the scanned form of the document. The record assists in viewing,

displaying, or printing the image correctly; in understanding how to interpret the

image, and in meeting contractual or legal requirements.

As a final note, this scanning effort required integral coordination and collaboration

between an operational unit of the library and the computer science research group. The

array of investigations, findings, and new questions have opened new paths for ongoing

work.

The most important concept in the KahnNVilensky paper is the creation of a “handle” or

a permanent unique identifier for every document. The “handle” is used to name the

document on a server. A mechanism called a “handle server” maps the permanent

unique identifier to the machine address. A working prototype of the “handle server” is

available at CNRI and the “handle” functionality is being integrated into WWW browsers.

Once browsers, like Netscape and Mosaic, know how to deal with “handles” a user/client

with a unique identifier will be able to send a message to the “handle server” that will

know on which document server the document resides. The client will then go to the

document server using the URL or machine address. Unique permanent identifiers that

are known worldwide and automatically map to the machine address using “handle

servers” will be a very powerful tool for digital libraries. No longer will Web servers

contain false links because the “handle servers” can update on a nightly basis.

The “handle” concept seeks to separate naming issues from location/address issues.

Handles are not URL’s; handles are an approach to a large-scale problem of naming

objects that may change location over time. For libraries it is an important strategic

and intellectual advance to be able to distinguish the name of an object from its address.

Intellectual Property is a fundamental issue in building digital libraries. At the

beginning of this project, participants assumed that there would be few or no copyright

issues with the Technical Reports. They assumed that the reports published at their

schools were either in the public domain or that the rights were held by the University.

Later, as questions arose, the group assumed that a single strategy would work for each

institution. This proved to be naive. Upon investigation with legal counsel, researchers

discovered that each school treated it’s intellectual property differently and so five

different approaches evolved.

At Stanford, the librarians took on the role of insuring that these IP challenges did not

pose a risk to the University or to the Faculty. We identified scenarios that needed

attention and began to meet with legal counsel to determine appropriate responses. These

efforts helped us to articulate a set of intellectual property management models now used

by the CS-TR projects at Stanford and Cornell. We encourage other schools to use these

points to form guidelines for themselves. This is NOT legal advice. Every institution

must rely on the advice of its own counsel. The worldwide legal environment is

undergoing change and our current approach may become obsolete in the face of new laws

and treaties.

l If an author has signed or plans to sign an exclusive agreement with a publisher for a

particular work (or for substantially the same work) in a particular format, that

author cannot then sign a non-exclusive agreement with the institution for the same

work in the same format.

l If an author signs a non-exclusive agreement with an institution for a technical report

and then decides to publish the same work elsewhere, the author should inform the

publisher of this previous agreement. The author should then grant the Server

Management written permission for the non-exclusive rights to publish, perform, and

display the works before any works are loaded. If the author indicates s/he has already

signed an exclusive agreement with a publisher, the technical report should probably

not be mounted on the server without permission of that publisher.

l At some institutions the authors do not own the rights to their works. Each institution

should be clear about copyright ownership before mounting technical reports on

servers. The CS-TR group did not address the issue of third party rights in technical

reports. When authors sign agreements it is assumed that the entire work is original or

that the author has the rights to include non-original tables, charts, figures, etc.. This

is one area that could be pursued by asking authors specifically about the originality of

their works.

The CS-TR collaboration consisted of long discussions and compromise which created a

system that is more logical than it would be without the collaboration. However,

collaborations of this kind create tensions and what Leigh Star refers to as double bind

situations (Star, 1995). Each institution was funded mainly to do research in specific

areas of digital libraries. Each institution wanted to populate their servers quickly in

order to get on with the research. In addition, all researchers wanted to get their

systems used by as many people as possible. The products of the individual institutions

and the collaborations have been quite successful. Lycos has thousands of accesses every

day. Sift also has over 10,000 subscribers. The Dienst CS library now has 14

institutions using it as the production system to disseminate their technical reports.

The prototype system is now being used in production. But now, enhancements and

changes to the system are problematic; each time the Dienst code is upgraded, individual

institutions have to work to implement the new system.

A dynamic outcome of this tension between research/prototyping and operations is the

momentum to address the research questions embedded in this topic. For example, there

are key research questions regarding distributed scale and linking of digital objects,

Since the inception of the CS-TR project, librarians have worked in a collaborative

atmosphere with computer scientists. Both groups of participants brought strengths to

the project, and the cooperative results are superior to those if either group had

conducted the project alone. Through ongoing discussions and consideration of common

problems, such as the proposed handle mechanism, an atmosphere of trust and respect

was created. The computer scientists were respectful of the librarian’s concerns

regarding the ongoing sustainability and operation of the service when the project

funding ended, and the librarians gained greater insight and admiration for the

innovation and “can do” spirit of the scientists.

For example, early in the design stage of the project, the appropriate structure and

creation of the index record for the TR’s was a key discussion topic. The computer

scientists had expectations for fast and easy creation by a variety of staff or by

technology, and librarians made the point of consistent record content and the flexibility

of multiple uses of the index record. The result, RFCI 807, accommodated both

requirements in a sustainable, scalable manner. The records can be created by

publishing assistants and can be created immediately upon acceptance of the TR. The

records are also distinguished by consistent definition and use of the record fields, and

conversion routines are in place to facilitate MARC record creation or use of the record

in other formats. Another example is the collaboration of document service staff in the

MIT Libraries with researchers to create an operational scanning service to convert

TR’s to page image form and to create the process and mechanism to accommodate massive

amounts of information.

Doctoral granting U.S. institutions in computer science are invited to participate.

Other institutions of higher education or commercial or government research

laboratories who wish to participate should contact Rebecca Lasher, Computer Science

Librarian (rlasher@forsythe.stanford.edu) to inquire about their possible involvement.

3. Before beginning to participate, institutions should evaluate their resources and

commitment to this project. It is anticipated that this project will continue as an

ongoing, operational service which will expand in content and participants even after the

CS-TR project concludes. Therefore, institutions should only join if they feel they will

be able to maintain their commitment over the long term.

At the June 1995 CS-TR meeting, the group agreed to ask the Computing Research

Association (CRA) to endorse and to encourage the proliferation of this technology. A

new consortium effort called NCSTRL or Networked Computer Science Technical Report

Library which is a merging of two earlier systems, the (ARPA)-sponsored CSTR project

and WATERS (Wide Area TEchnical Report Service) which was sponsored by the National

Science Foundation (NSF). This new effort will continue to contribute to the broader

Digital Library community. Libraries are operational, production oriented service organizations. The librarian’s

evaluation of research tends to focus on how successfully the products of research are

integrated with or replace existing services; and how well they can be supported and

renewed in a production environment. The CS-TR project has built several new

prototypes, and they must now be extended into a production environment. It may be

useful to think of the CS-TR project as beginning to address some of the key

investigations for system design processes:

1. Discovery: matching the technology with the service vision. 2. Delivery: nurturing

and developing this match in a prototype atmosphere to examine its feasibility and

readiness for implementation. 3. Service: the ongoing operations of the service;

continuous improvement of the service. 4. Support: provision of assistance,

documentation, training, etc. 5. Integration: fit of the new service with the

organization’s overall architecture and services.

The results obtained by the CS-TR consortium provide a model of a working distributed

digital library. These results will be useful for launching the new Joint Initiative DL

Projects and as the conceptual frame work for further research. Beyond the current

CS-TR effort, we believe that the CS-TR Consortium could also continue to contribute to

the broader Digital Library community (Lynch, Garcia-Molina, 1995).

From the librarian’s perspective, the CS-TR project offered the opportunity to work

with and contribute to a world-class effort to transform scholarly communication. The

learning experience was intense and gratifying. More questions have been formulated

than were answered, but the new questions are better articulated and understood. The

foundation laid by the CS-TR has immediate benefits and long term viability. We should

note, we continue to evolve a definition of digital library. One of the questions is

whether “digital library” is a real library - as we might define a library today - or

whether the phrase is a metaphor for something entirely different. This report is a

small step towards publicizing and presenting these CS-TR findings for broader

dissemination and discussion in the Library community.

This work was sponsored in part by the Corporation for National Research Initiatives,

using funds from the Advanced Research Projects Agency of the United States Department

of Defense under CNRl’s grant No. MDA-972-92-J-1029. The views and conclusions

contained in this document are those of the authors and should not be interpreted as

necessarily representing the official policies or endorsement, wither expressed or

implied, of ARPA, the U.S. Government or CNRI.

.

The Advanced Research Projects Agency (ARPA) Free Downloads Any

0 comments:

Speak up your mind

Actress Photos :

Cine Videos :

Actress Gallery