Using HTTrack to harvest documents for IR deposit
Presentation Topic
The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.
The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.
Was it worth it? Well…
Start Date and Time
5-6-2015 3:15 PM
End Date and Time
5-6-2015 4:15 PM
Recommended Citation
Carter, Andy, "Using HTTrack to harvest documents for IR deposit" (2015). Digital Commons Southeastern User Group. 1.
Using HTTrack to harvest documents for IR deposit
The University of Georgia (UGA) launched its DSpace repository – Athenaeum@UGA – in 2010, as part of an IMLS-funded Georgia Knowledge Repository grant. After several years of outreach, the number of faculty willing to self-submit remained miniscule. In an effort to find more scholarly materials, more efficiently, I initiated a project to harvest files directly from UGA websites. Using a web site archiving software called HTTrack, I downloaded over 27,000 files. I set HTTrack to harvest specific file types from every departmental domain that I could identify.
The result was a (terrifying) mass of files with little to no identifying information. This presentation will discuss the rationale for such a harvest, the mechanics of using HTTrack, and the efforts to sort through the resultant pile for scholarly ‘gold.’ I will also discuss, tentatively, some lessons learned.
Was it worth it? Well…