Tuesday, 18 August 2009

High usage of Institutional Repository content – but can we believe the statistics?

The full text down-load statistics now being made available from a growing number of open access Institutional Repositories (IRs) are showing very high levels of usage by researchers around the world, developing countries being within the top user-countries. But some have criticized these figures, fearing that the usual problems with web statistics are showing higher levels of usage than is the reality.

Troubles that can occur when attempting to analyse web statistics include accessions by web crawlers, crawlers or other access processes becoming ‘stuck’, intranet usage that may be of local relevance, or occasional individual authors downloading their own publications for distribution, teaching or other purposes. These records of ‘usage’ would not reflect global professional interest in articles.

As criticism of web statistics as ‘worse than useless’ has been made, it seemed important to check with some of the experienced managers of established IRs what actions are taken, if any, to minimize spurious downloads. Are the usage statistics they record of full text downloads reliable? Are steps taken to remove machine-generated accessions? Here are the answers they provided:

- "Yes, we “clean” our download statistics to exclude crawlers, agents and other “anomalous” situations. We do it in two complementary ways:
Automatic exclusion of downloads originated by “well behaved” crawlers and/or by crawlers already identified as such in our database of agents and crawlers;
We automatically generate an “observation list” of “suspicious” behavior from particular IP addresses. Every month we check that list, and 3 things can happen: 1 – We conclude it’s a crawler and the IP is added to the crawler database and the downloads are removed from the statistics; 2 – We conclude it’s not a crawler and the IP is removed from the suspicious list; 3- We can’t get a conclusion and the IP remains for another month on the observation list".

- “We had to specifically exclude a lot of crawlers, as they occasionally clobbered up the eprintstats database that we use. That is the main reason that our usage stats dipped at the start of this year. I don't think that repeat downloads are a big problem - those from the local Intranet are identified anyway. I do recall that one obscure paper was top of the charts one month, which turned out to be due to a crawler getting stuck on it, so I removed its stats.”

- “I think the figures for downloads are more reliable than those for abstract views. There is bound to be some crawler usage in our stats, but I think we have done what we can to minimise them and so I don't think the stats give a distorted picture. I agree that you do need to be cautious when quoting web usage stats.”

- “As a rule of thumb, about 50% of repository downloads are attributable to non-human clicking. Our IRStats makes efforts to filter these out. For a start, we maintain a list of addresses of the major, known crawlers. We also ignore sites that download "too much". And finally we discount multiple downloads of the same item from the same browser within a particular period. . . . I think that (to our best efforts) you don't need to adjust the figures at all from IRStats. What we can't tell you is whether the downloads represent "genuine scholars", "commercial researchers", "members of the public", "students" or "mistaken downloads".”

- “The visits to the IRs by the numerous crawlers are taken into consideration by the software that [is] used for generating the statistics. [But even if] such visits [are] taken care of, a small percentage (< 5%) could still creep in as new crawlers keep emerging on a regular basis. Some software have the feature of not considering the visits/downloads from the Intranet. This has to be done at the software configuration level. In our case, I'm yet to implement this feature. In my opinion, the visit/downloads from the Intranet could be up to 10%. Both the figures given above are rough estimates based on my judgement.”

It can be taken, judging from these comments from IR managers (using both eprints and dspace software), that the major ‘web-stats problems’ are taken care of and that the figures recorded for usage reflect as near as possible genuine usage. However, while the statistics tables can therefore be a reliable measure of the growing volume of research information being down-loaded daily, the following provisos remain:

- there will always be small levels of uncertainties in interpretations because of the fluid nature of the web;
- not all IRs may be as assiduous as those contacted here in ‘cleaning up’ their statistics reports, and newly established IRs may still be incorporating checking procedures;
- comparisons of usage recorded by different IRs may be difficult because of differences in configuration or management procedures. However, usage of a single site over different time periods remains a valid measurement of growth in usage of a site;
- as one of the IR managers stated, it is still not possible to identify the precise usage to which downloads will be put, and it may be that this can never be established, given the increasingly diverse ways in which the research community can now exchange information. However, it is common sense to assume that a user is unlikely to take the time to download technical articles (and in the case of developing countries to suffer the irritations and costs of low bandwidth) unless there is a genuine professional interest in an article as a way to enhance the user’s own research. There is also now good evidence to show that downloads lead to citations, and a recent paper from Cornell University analyses different mechanisms now available for measuring research impact.
- while usage figures supplied by IRs are a strong measure of the value of IR content to research, there is a need for standardisation that will make it possible for downloads from different IRs using different software to be compared. The projects known as COUNTER and PIRUS are two JISC-supported projects designed to address this, and include input from IRs (lead by Paul Needham from Cranfield Institute in the UK).

It is important that proponents of IRs ensure that the research communities understand that statistics of downloads from IRs are a reliable measurement and are different from those that may be quoted from informal and non-professional web sites. The limitations are well understood by IR managers and steps are taken to remove them as far as possible. The wider communities need to know that IR statistics are professionally managed and reflect reality within the limits that current technical methodology allows.

Since it is the mission of the EPT to promote the bi-directional exchange of research information, the reliability of usage statistics from IRs is essential if they are to be quoted as evidence of the value placed on these resources by the global research community. The comments provided by these managers of established IRs provide reassurance that they are reliable and can be quoted with confidence.

The following web sites provide examples of some of the IR statistics pages showing usage by country. From these it can be seen that usage by developing countries from IRs located both in the developed and emerging countries is high.

- University of Strathclyde, UK
- University of Otagao, Business School, New Zealand
- University of Minho, Portugal
- University de Los Andes, Venezuela
- Indian Institute of Science, Bangalore [not yet guaranteed 24/7 because of development work]

To access all registered IRs, use :

A final thought – as all working researchers know, access to any single article may be the key to a major breakthrough, so counting downloads (or even citations) can never reflect the true impact that IRs have on research – but IRs – along with open access journals – open doors that were previously closed to all but a few.

Posted by Barbara Kirsop, with many thanks to IR managers who provided details of their statistics management procedures.

1 comment:

Iryna Kuchma said...

Thank you for raising this issue Barbara and for providing the solutions! From our experience with the federated repository I know how tricky it is for the manager to analyse the log files and to identify the ‘technical’ visitors that has nothing to do with the content of the site.