Title
Using HTTrack to harvest documents for IR deposit
Presentation Topic
Workflows
Description
The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.
The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.
Was it worth it? Well…
Start Date and Time
5-6-2015 3:15 PM
End Date and Time
5-6-2015 4:15 PM
Recommended Citation
Carter, Andy, "Using HTTrack to harvest documents for IR deposit" (2015). Digital Commons Southeastern User Group. 1.
https://digitalcommons.winthrop.edu/dcseug/2015/schedule/1
Using HTTrack to harvest documents for IR deposit
The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.
The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.
Was it worth it? Well…