The Cathologuer

Automata in the library

More musings on the role of digital technology in the library, in response to my course at #citylis.

The following is my third and final reflection on my #citylis course on ‘Digital Information Technologies and Architectures’.

The world has turned and we find ourselves coming to the end of the first semester (already!) here at #citylis, and therefore also to the end of our course on digital technology and its impact on library and information science (LIS).

Since I posted my last reflection on this topic, the course material has gone on to cover the following areas: altmetrics, a branch of bibliometric analysis which measures the impact of research in non-traditional arenas, including on social media; coding, digitial text analysis and text mining, with a nod towards the related field of the digital humanities; and, finally, the rather wide spectrum of technological developments which come under the heading of artificial intelligence (AI).

When thinking recently about a way to sum up the last three weeks of the course, I was reminded of the model of the information communication chain, a concept initially proposed by Lyn Robinson in a paper from 2009 as being representative of the fundamental area of interest for reseachers in LIS.

screen-shot-2016-09-18-at-19-49-06 — Information Communication Chain – © @lynrobinson

Above all, I started to think about the ways in which these technologies might inform our understanding of the various stages of the dissemination, sharing and management of information, and also in some instances its organisation and retrieval as well. It also came home to me just how similar they are in some ways, at least in terms of the implications they have for human agency and understanding.

Altmetrics is a good place to start, mainly because it is a clear example of digital technology being used to aid what in some ways was (and still is) the task of LIS professionals, above all subject specialist librarians. Evaluating the significance of a particular document – a journal article, blog, dataset, etc. – in this case by measuring the amount of attention it is receiving on a number of communication channels (or so-called “impact flavours” such as blog posts, mainstream news, Wikipedia articles, policy documents, discussions on social media, and so on), allows for a new perspective on the spread of information, one which promises an alternative, more nuanced picture of scholarly communication and (re)use to that of more traditional metrics focused on citations.

Attention does not necessarily correlate with quality, of course; the figures by themselves have the potential to mislead, above all when it comes to analysing the distinct patterns of reception and discussion at work in the life-cycle of a particular research output. Nevertheless, the increasing volume of scholarly literature in circulation, its growing complexity and diversity (as made clear in a 2014 report by OCLC Research on The Evolving Scholarly Record), means that some sort of indicators as to currently trending topics and contributions to ongoing discussions can but be of help both to the acquisitions librarians tasked with collections development in a certain area and to those given custody of their institution’s digital repository.

What is more, the integration of altmetrics tools (such as Altmetric and Plum Analytics, to name but a few) with the APIs of web services like Twitter and Mendeley means that the task of gathering numbers of citations and sifting through the references to an article on social media to a great extent be automated.

In a similar way, the applications of coding and AI within LIS both allow for the automation of a number of routine tasks and the ability to attain to otherwise overlooked insights into aspects of a certain datasets (such as those recently made available by the British Library from its collections).

The two are also linked in uncanny ways. Python, one of the primary programming languages used in coding for the digital humanities, is also at the heart of several projects currently being undertaken in the field of machine learning. What is more, the techniques used in the digital humanities to perform text and data mining are likewise important parts of the related process of natural language processing (i.e. the transference of semantic, linguistically-based information contained in human speech into digital information via a series of statistical and probabilistic models). In both cases, the presence of large corpora of well-formed, ‘clean’ textual data is required.

I admit, it is sometimes difficult not to focus on the disruptive elements of such technologies. The potential use of neural networks in the automatic classification and indexing of documents, to take just one example, could have an enormous effect on LIS theory and practice when it comes to information organisation. What would prevent an AI cataloguer from deciding to classify a document in a completely different way to a human, based upon different logical or statistical criteria? Would the machine even require an information organisation system that corresponded to anything a human could feasibly understand? How would this change the process of information retrieval, for instance?

And what of other tasks in the library? To name just one recent example, the advent of an AI digital legal assistant, ROSS, currently being developed in Toronto and built on the model of IBM’s Watson, could have significant impacts on the job of legal reference librarians.

On the other hand, it would be unwise for LIS to stick its proverbial head in the sand when it comes to these technologies. As they become more and more a part of the information communication landscape and a feature of everyday reality in the way in which scholarly research and other knowledge is disseminated and conducted, libraries and other stakeholders with an interest in the documented record of humankind will naturally need to find a way of incorporating these technologies into the services they provide to their users. At the same time, however, they should still be worrying about the ethical implications these and other technologies may present.

Thanks to Lyn and David, at least our cohort at #citylis should be suitably prepared for that potentiality!

Search and find (more), or, What can libraries learn from Google?

In my last post I took an admittedly rather cursory look at the ways in which the quantities of data being generated in the modern, networked society provide distinct challenges for library and information professionals. In this, my second reflection on the topic of ‘Digital Information Technologies and Architectures’, I want to take the step and think a bit more about what this means for libraries and other cultural heritage institutions.

The question I want to ask is this. What exactly can libraries (and other cultural heritage institutions, for that matter) learn from Google?

The question might seem a little provocative, I admit. After all, the dichotomy between libraries and web-based search engines (other brands of which are of course available, but note that only Google is included as a verb in the OED) is one you hear talked about a lot, especially when questions of the relevance of library services (usually public or academic) are discussed.

For instance, when former Fox News presenter Greta Van Susteren took to Twitter earlier this month to criticise the building of expensive new library buildings on U.S. college campuses as “vanity projects”, arguing that the same information (and, indeed, much more) could instead be accessed via students’ smartphones, librarians and academics were quick to respond along the familiar lines of the debate. Libraries, they argued, offered their users immensely more in terms of value-added services than the search engines and the web – with all their potential for inadequacy and bias in the results they present – ever could.

As you may be relieved to hear, I don’t intend to delve much deeper into this particular debate in this post. For one thing, as has already been argued by Ned Potter in a piece on this topic, entitled (rather tellingly) ‘For The Last Time, Google Is Not Our Competition in Libraries‘, the terms of the debate, which tend to place library services in direct rivalry with Google for the hearts and minds (or at least the reliance and curiosity) of those who seek after information, are frequently overblown. In other words, we tend to use search engines and libraries for very different things; you wouldn’t expect, after all, to use a library to find out general information about the weather, or where to buy cheap cinema tickets.

Search engines such as Google are, in some ways, an extension, a branching out into the everyday contexts of all of us who live in the networked, data-rich information society that surrounds us, of one of the most basic functions of the library as a memory institution throughout its long history: that of information retrieval (IR). Along with the other technologies which have been introduced over the past few weeks of my course at #citylis (metadata, relational databases, RDF, linked open data, APIs, etc.) they exist to help make sense of, describe, index, and get access to, data which are made available over the Internet.

Of course, the ways in which search engines locate and present their results may vary in terms of quality and trustworthiness. This is especially true when it comes to the problematic issues of differences in relevance ranking algorithms and the excessive personalisation of search results. Either of these can lead to a user simply gaining a less-than-accurate impression of the available information or – at the other end of the spectrum – to the formation of what Eli Pariser has termed ‘filter bubbles’: that is, the search engine will give us the results it thinks we want, without exposing us to different or alternative viewpoints. (Whether or not such ‘bubbles’ had anything to do with the outcome of certain recent political events is a question for another time…)

On the other hand, the speed and flexibility with which services like Google can obtain a large number of results of sufficient quality to satisfy most users, has meant that it has become, for some, nearly synonymous with the web. Language use is a case in point here: when we say “I’ll just Google it”, do we really mean any more than simply “I’ll look it up (on the Internet)?” The sheer ubquity of Google as a default search engine in most browsers has only added to this sense of obviousness.

Perhaps there is something to be said, then, for the idea that Google (and its competitors) have taught us how to search for information in a particular way? By allowing the user to enter free-text queries into a single white box, without having to construct a series of more complex (and, admittedly, more precise) commands into an SQL-based interface, these services have seemingly cornered the market when it comes to ease and convenience.

This is turn has had an impact on expectations. Indeed, the growing use of ‘discovery systems’ such as Serials Solutions’ Summon and ExLibris’s Primo in academic library settings may be interpreted as a reponse to demand from students to have access to all the various resources held by the library – eBooks, bibliographic databases, digital collections, online journals, and print holdings – through one single aggregated interface. What is more, the possibilities that these systems offer to draw in metadata for resources from outside of the library’s holdings, with links to full-text publications readily available via the technologies that support linked open data (such as stable URLs, crossrefs, and so on), mean that the perception of the library as a reliable gateway to information may be reinforced (Shapiro, 2013).

I have been playing around a lot recently with Europeana, a online collection of cultural heritage data provided by institutions – libraries, museums, galleries, and archives – from around Europe and the world. All of the content on the site is published as linked open data, with good quality metadata provided for each item; the Europeana API also allows other web services to draw upon the data in the collection for their own purposes. In many ways, it represents my ideal of a digital library and, indeed, of a a library discovery system in general. And it leads me to thinking. If libraries and other members of the GLAM sector are able to contribute to the standards on the web for resource description and indexing (metadata, data curation and networking, and so forth), perhaps what we really should learn from Google is how our users would prefer to search for the information on our systems?

Not drowning, but waving? LIS and data

Musings on the relationship between LIS and big data.

This post is the first in a series of reflections on my #citylis course INM348 ‘Digital Information Technologies and Architectures’.

It seems to have become something of a self-evident truth that our production and consumption of data is growing at an increasingly mind-boggling rate. The figures themselves appear to show what amounts to nearly unfathomable quantities. In one oft-quoted calculation, the computing firm IBM estimated in 2012 that 2.5 exabytes (or, in other words, 2.5 billion gigabytes) of digital data was generated every day.

In addition to this, the move online for a lot of everyday activities and transactions which took place previously in the analogue sphere – such as shopping, banking, communicating with friends, finding a date, and even praying – as well as the increasing number of networked personal devices owned by any one individual (which may even rise to an average of 6.58 by 2020 [1]), means that these numbers are set to grow ever higher by the end of the decade.

Given the twin benefits of convenience and speed which such digital interactions can offer, it is perhaps not surprising that a vast majority of us – or at least those of us who inhabit the more affluent portions of today’s connected, globalised world – have voluntarily chosen to convert much of our personal, social and cultural identities into the medium of binary code, to be distributed and shared via the ubiquitous computer networks that bind our world.

In some of his recent work, the philosopher Luciano Floridi gives an interesting observation which makes clear the relevance of the data deluge to Library and Information Science (LIS) and what might be called the conventional institutions of sociocultural memory (i.e. libraries, archives, record centres, etc.). “Every day,” writes Floridi, “enough new data are being generated to fill all US libraries eight times over” [2]. Although the point he is making here is to highlight the scope of the data being generated, I find Floridi’s comment particularly useful as a starting point for my own reflections on the topic.

From one point of view, the task of collating, curating, cataloguing and preserving a large proportion of this data can and should fall to professionals in the LIS field. As the sector with historically the strongest interest, not to mention training, in storing and managing large quantities of recorded information, LIS practitioners and institutions are surely among the best placed to take on the role of making our society’s digital record accessible to future generations. Such is clearly the rationale behind projects such as the ambitious plan announced back in 2010 by the Library of Congress to archive the entirety of Twitter, for instance.

The sheer size of the quantity of data being generated, however, makes such a problem one which is beyond the scope of any particular institution or conglomerate of national public bodies. This is especially true given the inevitable bottlenecks to do with the size and cost of storage and information infrastructure, as well as the relatively limited life expectancy of current forms of digital memory hardware, at least when compared to more traditional hard copy formats (see here for a useful, albeit slightly out of date, infographic on this topic).

On the other hand, any sort of international collaboration to the degree which would be required is equally likely to be hampered by concerns about compliance, governmental policy, legal and moral ownership, and security issues. The mixed reactions to the recent hand-over of control over the internet’s domain name system (DNS) to ICANN, an international consortium, by the US government are surely rather telling in this respect.

To add to this, the need to discern which data are to be stored and made accessible (as well as how? and to whom?) only compounds the problem still further. Storing all of the records of every person on the globe is one thing; linking them together, putting them in order, cataloguing, classifying, and indexing them, all in a useful and usable manner, is quite another. And while those commentators who postulate a bright and benign future purpose for this data talk in terms of democratizing the records of society and providing a hitherto-unsurpassed resource for the historian of several generations from now, the digital record must itself be somehow refined and filtered in order for it to be properly understood in context, and to truly become knowledge [3].

Quite who is to be given this crucial role of refining the data, and the semi-editorial choices that follow concerning their relevance and place in the record, is just one of the possible points for concern which the more utopian dreams for big data seem strikingly reluctant to address. Whose data, after all, is going to be considered worth deleting, and why? And do we as individuals have any say over what of ours enters the preserved record of digital mankind?

I fear that in the world of the distributed cloud server, the equivalent of asking for one’s personal papers to be burnt after one’s death is no longer an available option, as it was in previous generations, for exerting control over the information one left behind. Then again, however, some of the most treasured documents in any archive collection are often those which we would not have had it not been for a former executor to a will not doing their job properly.

To sum up: the role of data in LIS raises several ethical questions, some of which I look forward to exploring further during my time at #citylis.

References

[1] Evans, D., ‘The Internet of Things: How the next evolution of the Internet is changing everything’, Cisco white paper (April, 2011)

[2] Floridi, L., The 4th revolution: how the infosphere is reshaping human reality (Oxford: OUP, 2014), p. 13.

[3] For a useful introduction to the various conceptual models of the way in which data, once refined, can become information, then knowledge, then (perhaps?) wisdom, see D. Bawden and L. Robinson, Introduction to Information Science (London: Facet, 2012), pp. 73-5.

New season, new blog

Hi! My name is David, and this blog is intended to be a way of sharing my thoughts about the world of Library and Information Science (LIS), since that is the subject in which I am about to begin a master’s degree at City University London, as part of the Library School #citylis. I am excited, mainly because this is a new and welcome opportunity for me, but also because the Department (headed by Dr. Lyn Robinson and Prof. David Bawden, both of whom have excellent blogs) is renowned for its focus on conceptual openness of the idea of the “library” and the possibilities – philosophical, technological, and ethical – that information science poses for 21st-century society.

One particular feature of the course is that it makes use of social media platforms (Twitter, blogging sites, and so on) in order to interact with, and gain feedback from, students taking the program. Having been used to very much more traditional teaching style during my time at university prior to this (I did a BA in English at Cambridge, followed by an MPhil in Medieval Studies at Oxford, if you really want to know), this should all be very new, enlightening, interesting, and ultimately very useful indeed.

There’s just one problem, and that’s this:

I am a terrible blogger.

That’s right, you heard me. I am simply terrible. Being excruciatingly awkward in most face-to-face social situations, the idea that my half-baked musings about life and other topics of interest seemingly to me alone might be read over the Internet by just about anyone in the world fills me with a certain non-negligible amount of terror. While admittedly some may seem the anonymity of the online op-ed (or indeed nasty Twitter comment) as liberating, I still feel deeply embarrassed whenever I try to write blog posts, as I were composing dead letters to a recently departed girlfriend only for them to be read by a passer-by (who, incidentally, can also instantly find out my identity).

Compounded to this are problems of structure and tone. For one thing, I’ve no idea how to begin! Or how to come up with postings that a). aren’t beyond dull, yet b). appear with sufficient regularity to satisfy my followers (should I ever have any)! Perhaps all first-time bloggers feel this way, and some of them may or may not be just as vocal about it (I haven’t checked). The typical blog post seems, after all, to straddle several of the territories occupied by previous forms of writing: a potent admixture of memoir, editorial, review, letter, autobiography, diary entry and self-advertisement is often the result, stridently public yet at the same oddly private, too.

And perhaps this formal confusion is the source of my anxiety. All previous blogs I have tried to write (all of which, thankfully, now consigned to the recycle bin of Internet history) were far too personal in many ways. With any luck, the added impetus of having to write in order to fulfill the requirements of my course at City will help me get over some of the difficulties I have in writing in this medium. All I can say for now is, we can but hope; at least I’ve managed to write this post!

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31