Title

Using HTTrack to harvest documents for IR deposit

Presentation Topic

Workflows

Description

The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.

The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.

Was it worth it? Well…

Start Date and Time

5-6-2015 3:15 PM

End Date and Time

5-6-2015 4:15 PM

This document is currently not available here.

Share

COinS
 
Jun 5th, 3:15 PM Jun 5th, 4:15 PM

Using HTTrack to harvest documents for IR deposit

The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.

The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.

Was it worth it? Well…