TY  - JOUR
TI  - Web Archiving: Issues and Problems in Collection Building and Access
AU  - Jacobsen, Grethe
T2  - Liber Quarterly: The Journal of European Research Libraries, vol.18, no.3-4
AB  - Denmark began web archiving in 2005 and the experiences are presented with a specific focus on collection-building and issues concerning access. In creating principles for what internet materials to collect for a national collection, one can in many ways build on existing practice and guidelines. The actual collection requires strategies for harvesting relevant segments of the internet in order to assure as complete a coverage as possible. Rethinking is also necessary when it comes to the issue of description, but cataloguing expertise can be utilised to find new ways for users to retrieve information. Technical problems in harvesting and archiving are identifiable and can be solved through international cooperation. Access to the archived materials, on the other hand, has become the major challenge to national libraries. Legal obstacles prevent national libraries from offering generel access to their archived internet materials. In Europe the principal obstacles are the EU Directive on Data Protection (Directive 95/46/EC) and local data protection legislation based on this directive. LIBER is urged to take political action on this issue in order that the general public may have the same access to the collection of internet materials as it has to other national collections. Adapted from the source document.
DA  - 2008///
PY  - 2008
VL  - 18
IS  - 3-4
SN  - 1435-5205
UR  - http://liber.library.uu.nl/
KW  - Collection development
KW  - Digital archives
KW  - Access to materials
KW  - Denmark
KW  - Internet archiving
KW  - n
KW  - Web archiving, webarchiving, Internet archiving, n
KW  - webarchiving
ER  - 

TY  - JOUR
TI  - Researching Communicative Practice: Web Archiving in Qualitative Social Media Research
AU  - Lomborg, Stine
T2  - Journal of Technology in Human Services
AB  - This article discusses the method of web archiving in qualitative social media research. While presenting a number of methodological challenges, social media archives (i.e., complete recordings of posts and comments on given social media) are also highly useful data corpuses for studying the social media users' communicative practices. Through a theoretical examination of web archiving as a new method enabled by the web itself, and an example-based discussion of the methodological, technical, and ethical challenges of harvesting social media archives, the article discusses the merits and limitations of using social media archives in empirical social media research. Adapted from the source document.
DA  - 2012/07//
PY  - 2012
DO  - http://dx.doi.org/10.1080/15228835.2012.744719
VL  - 30
IS  - 3-4
SP  - 219
EP  - 231
LA  - English
SN  - 1522-8835, 1522-8835
UR  - https://search.proquest.com/docview/1550991769?accountid=27464
L4  - http://www.tandfonline.com/doi/abs/10.1080/15228835.2012.744719
KW  - Web archiving
KW  - web archiving
KW  - Research methods
KW  - social media
KW  - Social networks
KW  - article
KW  - 1.13: LIS - RESEARCH
KW  - audience studies
KW  - communicative practices
KW  - qualitative methods
ER  - 

TY  - JOUR
TI  - Ensuring Long-Term Access to the Memory of the Web Preservation Working Group of the International Internet Preservation Consortium
AU  - Oury, Clément
AU  - Steinke, Tobias
AU  - Jones, Gina
T2  - International Preservation News
AB  - Archiving the Web is the process through which documents and objects on the World Wide Web are captured and stored. There are and have been a number of ways through which this has been accomplished, but the end result is archived Web content (Web site, page, or part of a Website) that is preserved for future researchers, historians and the general public. Preservation involves maintaining the ability to present meaningful access to information over time. In the context of Web archives, the intention of preservation is to retain access to archived Web resources, so they can continue to be used and understood despite changes in access technologies or without unacceptable loss of integrity or meaning. The International Internet Preservation Consortium, chartered in 2003, is made up of institutions with basically similar goals of preserving Web content for heritage purposes and which generally share the same harvesting and access tools.
DA  - 2012/12//
PY  - 2012
IS  - 58
SP  - 34
EP  - 37
LA  - English
UR  - https://search.proquest.com/docview/1272325401?accountid=27464
KW  - World Wide Web
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Research
KW  - Web sites
KW  - Migration
KW  - Preservation
KW  - Data bases
KW  - Public access
ER  - 

TY  - JOUR
TI  - Profiling web archive coverage for top-level domain and content language
AU  - AlSum, Ahmed
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AU  - Van de Sompel, Herbert
T2  - International Journal on Digital Libraries
AB  - (ProQuest: ... denotes formulae and/or non-USASCII text omitted; see image) Issue Title: 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013) The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives' access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define ...... as the percentage of a TimeMap that was returned using ...... web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average ....... If we exclude the Internet Archive from the list, we can reach ...... on average using only the remaining top three web archives.[PUBLICATION ABSTRACT]
DA  - 2014/08/27/
PY  - 2014
DO  - 10.1007/s00799-014-0118-y
VL  - 14
IS  - 3-4
SP  - 149
EP  - 166
LA  - English
SN  - 1432-5012
UR  - https://search.proquest.com/docview/1547814948?accountid=27464
L4  - https://arxiv.org/pdf/1309.4008
L4  - http://link.springer.com/article/10.1007/s00799-014-0118-y
L4  - http://link.springer.com/10.1007/s00799-014-0118-y
KW  - Information science
KW  - Digital libraries
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Library science
ER  - 

TY  - JOUR
TI  - Webarchiving: Legal Deposit of Internet in Denmark. A Curatorial Perspective
AU  - Schostag, Sabine
AU  - Fonss-Jorgensen, Eva
T2  - Microform & Digitization Review
AB  - Since 2005 archiving the dynamic Internet has been required by law in Denmark. This article tells the story of the last seven years of experience with archiving the Internet in Denmark: What is covered by the law? How do we organize the work? How do we collect the web in practice? Who has access to the web archive? And finally, what are the challenges and future perspectives? The article focuses on the curatorial aspects and does not go into technical details. Adapted from the source document.
DA  - 2012/12//
PY  - 2012
VL  - 41
IS  - 3-4
SP  - 110
EP  - 120
LA  - English
SN  - 2190-0752, 2190-0752
UR  - https://search.proquest.com/docview/1520327503?accountid=27464
L4  - http://www.degruyter.com/view/j/mfir
KW  - Web archiving
KW  - Digital curation
KW  - Legal deposit
KW  - article
KW  - 3.2: ARCHIVES
KW  - Denmark
KW  - Electronic media
ER  - 

TY  - JOUR
TI  - Access and Scholarly Use of Web Archives
AU  - Hockx-Yu, Helen
T2  - Alexandria
DA  - 2014///
PY  - 2014
VL  - 25
IS  - 1/2
SP  - 113
EP  - 127
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1623365740?accountid=27464
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0023
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Nationale Grenzen im World Wide Web – Erfahrungen bei der Webarchivierung in der Österreichischen Nationalbibliothek
AU  - Mayr, Michaela
AU  - Predikaka, Andreas
T2  - Bibliothek Forschung und Praxis
AB  - Since 2009, the Austrian National Library performed four broad crawls, based on the Austrian Media Act, which focused primarily on the top level domain .at. The analysis of the crawls indicates that the aspect of national borders for the cultural heritage within the World Wide Web plays an important role for collection methods.
DA  - 2016/01/01/
PY  - 2016
DO  - 10.1515/bfp-2016-0007
VL  - 40
IS  - 1
SP  - 90
EP  - 95
LA  - English
SN  - 1865-7648
UR  - https://search.proquest.com/docview/1780113609?accountid=27464
L4  - https://www.degruyter.com/view/j/bfup.2016.40.issue-1/bfp-2016-0007/bfp-2016-0007.xml
KW  - Web archiving
KW  - World Wide Web
KW  - Library And Information Sciences
KW  - National libraries
KW  - 3.11:NATIONAL LIBRARIES AND STATE LIBRARIES
KW  - 14.11:COMMUNICATIONS AND INFORMATION TECHNOLOGY -
KW  - Webarchivierung
KW  - Austria
KW  - Broad Crawl
KW  - Domain Crawl
KW  - Österreichische Nationalbibliothek
ER  - 

TY  - JOUR
TI  - Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives
AU  - Brunelle, Justin F
AU  - Ferrante, Krista
AU  - Wilczek, Eliot
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - D-Lib Magazine
AB  - In this work, we present a case study in which we investigate using open-source, web-scale web archiving tools (i.e., Heritrix and the Wayback Machine installed on the MITRE Intranet) to automatically archive a corporate Intranet. We use this case study to outline the challenges of Intranet web archiving, identify situations in which the open source tools are not well suited for the needs of the corporate archivists, and make recommendations for future corporate archivists wishing to use such tools. We performed a crawl of 143,268 URIs (125 GB and 25 hours) to demonstrate that the crawlers are easy to set up, efficiently crawl the Intranet, and improve archive management. However, challenges exist when the Intranet contains sensitive information, areas with potential archival value require user credentials, or archival targets make extensive use of internally developed and customized web services. We elaborate on and recommend approaches for overcoming these challenges.
DA  - 2016/01//
PY  - 2016
DO  - 10.1045/january2016-brunelle
VL  - 22
IS  - 1/2
SP  - 1
LA  - English
SN  - 1082-9873
UR  - https://search.proquest.com/docview/1806649179?accountid=27464
L4  - http://digitalcommons.odu.edu/computerscience_fac_pubs/10/?utm_source=digitalcommons.odu.edu%2Fcomputerscience_fac_pubs%2F10&utm_medium=PDF&utm_campaign=PDFCoverPages
L4  - http://www.dlib.org/dlib/
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - 3.2:ARCHIVES
KW  - Archivists
KW  - Case studies
KW  - Open source software
ER  - 

TY  - JOUR
TI  - A semantic architecture for preserving and interpreting the information contained in Irish historical vital records
AU  - Debruyne, Christophe
AU  - Beyan, Oya Deniz
AU  - Grant, Rebecca
AU  - Collins, Sandra
AU  - Decker, Stefan
AU  - Harrower, Natalie
T2  - International Journal on Digital Libraries
DA  - 2016/09/01/
PY  - 2016
DO  - 10.1007/s00799-016-0180-8
VL  - 17
IS  - 3
SP  - 159
EP  - 174
SN  - 1432-5012
UR  - http://link.springer.com/10.1007/s00799-016-0180-8
ER  - 

TY  - JOUR
TI  - Collecting and preserving the Ukraine conflict (2014-2015): a web archive at University of California, Berkeley
AU  - Pendse, Liladhar R
T2  - Collection Building
AB  - Purpose The purpose of this paper is to highlight the web-archiving as a tool for possible collection development in a research level academic library. The paper highlights the web-archiving project that dealt with the contemporary Ukraine conflict. Currently, as the conflict in Ukraine drags on, the need for collecting and preserving the information from various web-based resources with different ideological orientations acquires a special importance. The demise of the Soviet Union in 1991 and the emergence of independent republics were heralded by some as a peaceful transition to the "free-market" style economies. This transition was nevertheless nuanced and not seamless. Besides the incomplete market liberalization, rent-seeking behaviors of different sort, it was also accompanied by the almost ubiquitous use of and access to the internet and the internet communication technologies. Now 24 years later, the ongoing conflict in Ukraine also appears to be unfolding on the World Wide Web. With the Russian annexation of Crimea and its unification to the Russian Federation, the governmental and non-governmental websites of the Ukrainian Crimea suddenly came to represent a sort of "an endangered archive". Design/methodology/approach The main purpose of this project was to make the information that is contained in Ukrainian and Russia websites available to the wider body of scholars and students over the longer period of time in a web archive. The author does not take any ideological stance on the legal status of Crimea or on the ongoing conflict in Ukraine. There are currently several projects that are devoted to the preservation of these websites. This article also focuses on providing a survey of the landscape of these projects and highlights the ongoing web-archiving project that is entitled, "the Ukraine Crisis: 2014-2015" at the UC Berkeley Library. Findings The UC Berkeley's Ukraine Conflict Archive was made available to public in March of 2015 after enough materials were archived. The initial purpose of the archive was to selectively harvest, and archive those websites that are bound to either disappear or change significantly during the evolution of Crimea's accession to Russia. However, in the aftermath of the Crimean conflict, the ensuing of military conflict in Ukraine had forced to reevaluate the web-archiving strategy. The project was never envisioned to be a competing project to the Ukraine Conflict project. Instead, it was supposed to capture complimentary data that could have been missed by other similar projects. This web archive has been made public to provide a glimpse of what was happening and what is happening in Ukraine. Research limitations/implications Now 24 years later, the ongoing conflict in Ukraine also appears to be unfolding on the World Wide Web. With the Russian annexation of Crimea and its unification to the Russian Federation, the governmental and non-governmental websites of the Ukrainian Crimea suddenly came to represent a sort of "an endangered archive". The impetus for archiving the selected Ukrainian websites came as a result of the changing geopolitical realities of Crimea. The daily changes to the websites and also loss of information that is contained within them is one of the many problems faced by the users of these websites. In some cases, the likelihood of these websites is relatively high. This in turn was followed by the author's desire to preserve the information about the daily lives in Ukraine's east in light of the unfolding violent armed conflict. Originality/value Upon close survey of the Library and Information Sciences currently published articles on Ukraine Conflict, no articles that are currently dedicated to archiving the Crimean and Ukrainian situations were found.
DA  - 2016///
PY  - 2016
VL  - 35
IS  - 3
SP  - 64
EP  - 72
LA  - English
SN  - 01604953
UR  - https://search.proquest.com/docview/1829452180?accountid=27464
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Web sites
KW  - Social networks
KW  - Annual reports
KW  - Crimea
KW  - Institutional repositories
KW  - Library and information science
KW  - Russia
KW  - Ukraine
ER  - 

TY  - JOUR
TI  - Online British Official Publications from the University of Southampton
AU  - Caisley, Joy
AU  - Ball, Julian
AU  - Phillips, Matthew
T2  - Refer
AB  - The Library at the University of Southampton has a particularly strong collection of printed British Official Publications, known as the Ford Collection. The collection is named after the late Professor Percy Ford and his wife Dr Grace Ford who brought the collection to the University of Southampton in the 1950s from the Carlton Club and conducted research based on the collection. Hoping to increase both the appreciation and the use of official publications, Ford, the Fords compiled breviates or select lists, in seven volumes covering the years 1833-1983. These were not catalogues of all British Official Publications. Instead the Fords identified and summarised documents which have been, or might have been, the subject of legislation or have dealt with public policy, Ford. Although funding sources were for specific tasks and periods, the Library continues to work unfunded with these valuable digital collections in 2016 to ensure that they are made fully accessible for readers worldwide.
DA  - 2016///
PY  - 2016
VL  - 32
IS  - 2
SP  - 27
EP  - 32
LA  - English
SN  - 01442384
UR  - https://search.proquest.com/docview/1803448852?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Internet
KW  - Library collections
KW  - Metadata
KW  - Publications
KW  - 18th century
KW  - 20th century
KW  - Bibliographic records
KW  - Colleges & universities
KW  - Current awareness services
KW  - Funding
ER  - 

TY  - JOUR
TI  - Featured Web Resource: Theological Commons
AU  - Murray, Gregory P
T2  - Theological Librarianship
AB  - In late 2010, Dr Iain Torrance, at that time the President of Princeton Theological Seminary, asked a small subset of library staff to consider how to improve discoverability and access to the thousands of volumes on theology and religion that Princeton Seminary and other institutions had digitized through the Internet Archive, to facilitate research by students, scholars, and pastors both locally and globally. However, because the goal was to provide access to relevant resources, not to showcase Princeton's digital content, the digital library team subsequently took a detailed list of Library of Congress subject headings provided by Don Vorp, at that time Collection Development Librarian at Princeton Seminary, and performed searches in the Internet Archive system for digitized books with those subjects, irrespective of library of origin. Those items were then harvested in the same manner. This procedure soon amassed tens of thousands of digital texts, and in March 2012, the Theological Commons was publicly released as a free, web-accessible digital library.
DA  - 2016/10//
PY  - 2016
VL  - 9
IS  - 2
SP  - 1
LA  - English
SN  - 1937-8904
UR  - https://search.proquest.com/docview/1842842888?accountid=27464
L4  - https://theolib.atla.com/theolib/article/view/434/1513
KW  - Web archiving
KW  - Collection development
KW  - Digital libraries
KW  - Digitization
KW  - Archives & records
KW  - Internet resources
KW  - Access to materials
KW  - Princeton New Jersey
KW  - Religions And Theology
KW  - Theological schools
ER  - 

TY  - JOUR
TI  - What's trending in libraries from the internet cybersphere - bookless libraries - 02 - 2016
AU  - Oyelude, Adetoun A
T2  - Library Hi Tech News
AB  - Purpose Sean Follmer with his colleagues, Daniel Leithinger and Hiroshi Ishii have created inFORM, where the computer interface can actually come off the screen and one can physically manipulate it. Design/methodology/approach One can visualize 3D information physically and touch it and feel it to understand it in new ways. Findings The interface also allows one to interact through gestures and direct deformations to sculpt digital clay, and interface elements can arise out of the surface and change on demand. Their idea is that for each individual application, the physical form can be matched to the application. Urban planners and architects can use it to explore their designs in detail; using inFORM, one can reach out from the screen and manipulate things at a distance and also manipulate and collaborate on 3D sets, gesture around them and manipulate also. Originality/value It allows collaboration of people in ways hitherto not done. Posted on February 10, 2016, the Ted talk has received over one million views as at June 9, 2016. It is trending! The researchers are thinking of "new ways that we can bring people together, and bring our information into the world, and think about smart environments that can adapt to us physically".
DA  - 2016///
PY  - 2016
VL  - 33
IS  - 6
SP  - 19
EP  - 20
LA  - English
SN  - 07419058
UR  - https://search.proquest.com/docview/1823127353?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Library collections
KW  - National libraries
KW  - Web sites
KW  - E-books
KW  - Photographs
KW  - Trends
KW  - Internet of Things
KW  - Novels
KW  - Weblogs
ER  - 

TY  - JOUR
TI  - Where Did All the Information Go? Well at Least the Important Stuff
AU  - Johnson, Paul
T2  - Refer
AB  - There is a lot of current concern about the sheer amount of web pages and digital documents being lost forever. In this definition lost implies destroyed. A report in 2011 by the Chesapeake Digital Preservation Group suggested that approximately 30% of a control group of 2,700 online law related materials disappeared in 3 years. Librarians have been highlighting this concern for many years and must take a lot of the credit for the introduction of so many successful web archiving initiatives -- including the Non-Print Legal Deposit legislation enacted in the UK in 2013. However, for this article the author want to focus on a different kind of lost information, where the definition of lost refers to information that cannot be found. In December 2015 he gave a presentation at a Koha library systems event, which explored the changing environment of discovery services within the academic market. Clicking on the repeat the search link runs the search again with no apparent limits on the amount of results returned.
DA  - 2016///
PY  - 2016
VL  - 32
IS  - 2
SP  - 8
EP  - 12
LA  - English
SN  - 0144-2384
UR  - https://search.proquest.com/docview/1803538845?accountid=27464
KW  - Web archiving
KW  - Digital preservation
KW  - Library And Information Sciences
KW  - United Kingdom--UK
KW  - Social networks
KW  - Libraries
KW  - Legal deposit
KW  - Librarians
KW  - Information retrieval
KW  - Bias
KW  - Open access
KW  - Prejudice
ER  - 

TY  - JOUR
TI  - Avoiding Courseware With Slack
AU  - West, Jessamyn
T2  - Computers in Libraries
AB  - Slack is a cloud-based software tool for team collaboration. The author used it as the primary tool to teach an asynchronous graduate level course called Tools for Community Advocacy at the University of Hawaii's library and information science (UHLIS) program, and it went well. UHLIS uses courseware that is some of the best out there -- Laulima, based on Sakai -- but similar to all courseware, it has a steep learning curve and some limitations. As an adjunct who was teaching a single 6-week class, she didn't have the time available to learn to use the tool well. She decided to stick with what she knew -- which was Web sites, Google Docs, Skype, and Slack -- using Slack as the activity hub. Slack's pricing model is also attractive, which is why she mention it as a real option for libraries.
DA  - 2016/10//
PY  - 2016
VL  - 36
IS  - 8
SP  - 14
EP  - 15
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/1830247744?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet
KW  - Social networks
KW  - Libraries
KW  - Library and information science
KW  - Chat rooms
KW  - Educational software
KW  - Students
ER  - 

TY  - JOUR
TI  - Web archive profiling through CDX summarization
AU  - Alam, Sawood
AU  - Nelson, Michael L
AU  - Van de Sompel, Herbert
AU  - Balakireva, Lyudmila L
AU  - Shankar, Harihar
AU  - Rosenthal, David S; H
T2  - International Journal on Digital Libraries
AB  - Issue Title: Focused Issue on TPDL 2015 With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator's URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.
DA  - 2016/09//
PY  - 2016
DO  - http://dx.doi.org/10.1007/s00799-016-0184-4
VL  - 17
IS  - 3
SP  - 223
EP  - 238
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/1811726775?accountid=27464
KW  - Web archiving
KW  - Web archives
KW  - Memento
KW  - Library And Information Sciences--Computer Applica
KW  - Profiling
KW  - CDX files
KW  - Protocol
KW  - Queries
KW  - Query routing
KW  - Routing
ER  - 

TY  - JOUR
TI  - Libraries and digital memory
AU  - Massis, Bruce
T2  - New Library World
AB  - Purpose The purpose of this column is to consider the role of libraries in an effort to preserve and protect a collective digital memory. Design/methodology/approach This paper addresses literature review and commentary on this topic that has been addressed by professionals, researchers and practitioners. Findings Libraries and library consortia will help go forward into the future and expand as trusted repositories where digital memory can be preserved and shared. Originality/value The value in exploring this topic is to examine the library environment for collection, storage and dissemination of digital information.
DA  - 2016///
PY  - 2016
VL  - 117
IS  - 9/10
SP  - 673
EP  - 676
LA  - English
SN  - 03074803
UR  - https://search.proquest.com/docview/1830312026?accountid=27464
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Digitization
KW  - Books
KW  - Internet
KW  - Library collections
KW  - National libraries
KW  - Social networks
KW  - Consortia
KW  - Museums
KW  - Funding
KW  - Industrialized nations
KW  - Oral tradition
ER  - 

TY  - JOUR
TI  - Silencing Marginalized Voices: The Fragmentation of the Official Record - Library &amp; Information Science Collection - ProQuest
AU  - Garnar, Martin
T2  - Reference & User Services Quarterly
AB  - When researching historical topics, government statistics are often viewed as the most reliable source of information, lending credibility to the researchers' arguments by providing documentary evidence of how society is changing. In investigating issues related to equity, diversity, and inclusion, these statistics serve as benchmarks for the progress (or lack thereof) on how historic injustices are being addressed. Therefore, it is imperative that the information be reliable, verifiable, and available. In this case, the Internet Archive may have the missing pages on their website, but there's no guarantee that the desired information was captured, whether because pages were missed or snapshots missed important updates. There is also no guarantee that this nonprofit, nongovernmental website will continue to be available in the future. Without reliable access to government information, researchers will not be able to document what was available on governmental websites, and an important source of public policy data will be lost to future researchers.
DA  - 2018///
PY  - 2018
VL  - 57
IS  - 3
SP  - 193
EP  - 195
UR  - https://search.proquest.com/libraryinformation/docview/2016963494/abstract/D0A3C983EC43401FPQ/2?accountid=27464
Y2  - 2018/09/03/
ER  - 

TY  - JOUR
TI  - Building and querying semantic layers for web archives (extended version)
AU  - Fafalios, Pavlos
AU  - Holzmann, Helge
AU  - Kasturia, Vaibhav
AU  - Nejdl, Wolfgang
T2  - International Journal on Digital Libraries
AB  - © 2017 IEEE. Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (layers) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts and events), and publishing all this data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.
DA  - 2018///
PY  - 2018
DO  - 10.1007/s00799-018-0251-0
VL  - 19
IS  - 1
SP  - 1
EP  - 19
SN  - 14321300
KW  - Web archives
KW  - Exploratory search
KW  - Linked data
KW  - Profiling
KW  - Semantic layer
ER  - 

TY  - CONF
TI  - Memory Entanglements and Collection Development in a Transnational Media Landscape
AU  - Häusner, Eva Maria
AB  - Defining a national domain is the crux of the matter of every National Library’s mission. The National Library of Sweden collects, preserves, registers, and guarantees access to all materials published and distributed in Sweden, printed, audio-visual and since 2012, even electronic. Furthermore the National Library of Sweden collects Suecana, foreign publications which possess historical significance to Sweden and even Swedish literature in translation. Collection strategies have to be updated and developed to fit the times: Digitalization and media convergence presuppose a new concept and new definition of the national domain. How should the National Library work with selection and collection strategies in a way that to make sure that the Suecana-collection and the Swedish collection are truly representative and relevant? This paper describes difficulties inherent to defining a national domain in today’s media landscape and presents s
C3  - IFLA 2017
DA  - 2017///
PY  - 2017
SP  - 5
PB  - IFLA
L1  - http://library.ifla.org/1683/1/186-haeusner-en.pdf
KW  - National libraries
KW  - collection development
KW  - digitalization
KW  - international collaboration
KW  - Sweden
ER  - 

TY  - CONF
TI  - Constituer un réseau d’accès aux archives de l’internet : l’exemple français
AU  - Aniesa, Ange
AU  - Bouchard, Ariane
AB  - Depuis 2006, la BnF a pour mission de collecter l’internet français au titre du dépôt légal. Pour remplir cette mission au mieux, elle a progressivement mis en place un système d’archivage complet et ainsi collecté des milliards de pages web. Sur la base du décret d’application de la loi DADVSI, la BnF a cherché à rendre ses collections d’archives de l’internet, à l’origine uniquement consultables dans ses espaces Recherche, accessibles dans d’autres établissements en région. Cet article présente les différentes étapes de l’ouverture de ces accès : l’habilitation des bibliothèques de dépôt légal imprimeur ; les problématiques organisationnelles et techniques rencontrées et les solutions adoptées ; les enjeux au stade actuel du projet, alors que seize établissements sont déjà équipés d’un service d’accès aux archives de l’internet.
C3  - IFLA Congress 2017, Wroclaw, Poland
DA  - 2017///
PY  - 2017
UR  - http://library.ifla.org/1655/
Y2  - 2017/06/26/
ER  - 

TY  - JOUR
TI  - Doing Web history with the Internet Archive: screencast documentaries
AU  - Rogers, Richard
T2  - Internet Histories
AB  - This short article explores the challenges involved in demonstrating the value of web archives, and the histories that they embody, beyond media and Internet studies. Given the difficulties of working with such complex archival material, how can researchers in the humanities and social sciences more generally be persuaded to integrate Internet histories into their research? How can institutions and organisations be sufficiently convinced of the worth of their own online histories to take steps to preserve them? And how can value be demonstrated to the wider general public? It touches on public attitudes to personal and institutional Internet histories, barriers to access to web archives – technical, legal and methodological - and the cultural factors within academia that have hindered the penetration of new ways of working with new kinds of primary source. Rather than providing answers, this article is intended to provoke discussion and dialogue between the communities for whom Internet histories can and should be of significance.
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1307542
VL  - 1
IS  - 1-2
SP  - 160
EP  - 172
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1307542
ER  - 

TY  - JOUR
TI  - Out from the PLATO cave: uncovering the pre-Internet history of social computing
AU  - Jones, Steve
AU  - Latzko-Toth, Guillaume
T2  - Internet Histories
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1307544
VL  - 1
IS  - 1-2
SP  - 60
EP  - 69
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1307544
ER  - 

TY  - JOUR
TI  - Introduction: Internet histories
AU  - Brügger, Niels
AU  - Goggin, Gerard
AU  - Milligan, Ian
AU  - Schafer, Valérie
T2  - Internet Histories
AB  - The ways in which historians define the Internet profoundly shape the histories we write. Many studies implicitly define the Internet in material terms, as a particular set of hardware and software, and consequently tend to frame the development of the Internet as the spread of these technologies from the United States. This essay explores implications of defining the Internet alternatively in terms of technology, use and local experience. While there is not a single “correct” definition, historians should be aware of the politics of the definitions they use.
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1317128
VL  - 1
IS  - 1-2
SP  - 1
EP  - 7
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1317128
ER  - 

TY  - JOUR
TI  - A common language
AU  - Weber, Marc
T2  - Internet Histories
AB  - What would a cultural history of the Internet look like? The question almost makes no sense: the Internet spans the globe and traverses any number of completely distinct human groups. It simply cannot have a single culture. And yet, like the railroad, the telegraph and the highway system before it, the Internet has been an extraordinary agent for cultural change. How should we study that process? To begin to answer that question, this essay returns to four canonical studies of earlier technologies and cultures: Carolyn Marvin's When Old Technologies Were New; Leo Marx's The Machine in the Garden; Ruth Schwarz Cowan's More Work for Mother and Lynn Spigel's Make Room for TV. In each case, the essay mines the earlier works for research tactics and uses them as jumping-off points to explore the ways in which the Internet requires new and different approaches. It concludes by speculating on the ways that the American-centric nature of much earlier work will need to be replaced with a newly global focus and research tactics to match.
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1317118
VL  - 1
IS  - 1-2
SP  - 26
EP  - 38
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1317118
ER  - 

TY  - JOUR
TI  - Breaking in to the mainstream: demonstrating the value of internet (and web) histories
AU  - Winters, Jane
T2  - Internet Histories
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1305713
VL  - 1
IS  - 1-2
SP  - 173
EP  - 179
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1305713
ER  - 

TY  - JOUR
TI  - What and where is the Internet? (Re)defining Internet histories
AU  - Abbate, Janet
T2  - Internet Histories
AB  - Both the Internet and the Web beat out numerous rivals to become today's dominant network and online system,11. “Online system” in this essay is used as a generic term for Web-like systems, i.e. systems for navigating information over networks. The origins of the term are in the 1960s oNLine System (NLS). This essay uses “online world” as a generic term for all of cyberspace. View all notes respectively. Many of those rival systems and networks had developed alternative solutions to issues that face us today, from micropayments to copyright. But few scholars, much less thought leaders, have a meaningful overview of the origins of our online world, or of the many systems which came before. This exclusivity is a problem, since as a society we are now making some of the permanent decisions that will determine how we deal with information for decades and even centuries to come. Those decisions are about regulatory structures, economic models, civil liberties, publishing and more. This essay argues for the need to comparatively study online information systems across all these axes, and to thus develop a “common language” of known precedents and concepts as a prerequisite for making informed discussions about the future of the online world. Doing so depends on two factors: (1) preservation of enough historical materials about earlier systems to be able to meaningfully examine them; (2) interdisciplinary, international attention to “meta” stories that emerge from considering the evolution of multiple networks and online systems.
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1305836
VL  - 1
IS  - 1-2
SP  - 8
EP  - 14
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1305836
ER  - 

TY  - JOUR
TI  - Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation
AU  - Williams, Matthew L
AU  - Burnap, Pete
AU  - Sloan, Luke
T2  - Sociology
AB  - New and emerging forms of data, including posts harvested from social media sites such as Twitter, have become part of the sociologist’s data diet. In particular, some researchers see an advantage in the perceived ‘public’ nature of Twitter posts, representing them in publications without seeking informed consent. While such practice may not be at odds with Twitter’s terms of service, we argue there is a need to interpret these through the lens of social science research methods that imply a more reflexive ethical approach than provided in ‘legal’ accounts of the permissible use of these data in research publications. To challenge some existing practice in Twitter-based research, this article brings to the fore: (1) views of Twitter users through analysis of online survey data; (2) the effect of context collapse and online disinhibition on the behaviours of users; and (3) the publication of identifiable sensitive classifications derived from algorithms.
DA  - 2017/12/26/
PY  - 2017
DO  - 10.1177/0038038517708140
VL  - 51
IS  - 6
SP  - 1149
EP  - 1168
SN  - 0038-0385
UR  - http://journals.sagepub.com/doi/10.1177/0038038517708140
KW  - Twitter
KW  - social media
KW  - algorithms
KW  - computational social science
KW  - context collapse
KW  - ethics
KW  - social data science
ER  - 

TY  - JOUR
TI  - Can we write a cultural history of the Internet? If so, how?
AU  - Turner, Fred
T2  - Internet Histories
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1307540
VL  - 1
IS  - 1-2
SP  - 39
EP  - 46
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1307540
ER  - 

TY  - JOUR
TI  - Internet histories: the view from the design process
AU  - Braman, Sandra
T2  - Internet Histories
AB  - The electrical engineers and computer scientists who have designed the Internet are among those who have written Internet history. They have done so within the technical document series created to provide a medium for and record of the design process, the Internet Requests for Comments (RFCs) as well as in other venues. Internet designers have explicitly written the network's history in documents explicitly devoted to history as well as indirectly in documents focused on technical matters. The Internet RFCs also provide data for research on Internet history and on large-scale sociotechnical infrastructure written by outsiders to the design process. Incorporating the history of the Internet as understood by those responsible for its design, whether in their own words or by treating the design conversation as data, makes visible some elements of that history not otherwise available, corrects misperceptions of factors underlying some of its features, and provides fascinating details on the people and events involved that are of interest to those seeking to understand the Internet. Within the RFCs, history has served both technical and social functions.
DA  - 2017/01/02/
PY  - 2017
DO  - 10.1080/24701475.2017.1305716
VL  - 1
IS  - 1-2
SP  - 70
EP  - 78
SN  - 2470-1475
UR  - http://www.tandfonline.com/doi/abs/10.1080/24701475.2017.1305716
ER  - 

TY  - CONF
TI  - Getting Started in Web Archiving
AU  - Grotke, Abigail
AB  - This purpose of this paper is to provide general information about how organizations can get started in web archiving, for both those who are developing new web archiving programs and for libraries that are just beginning to explore the possibilities. The paper includes an overview of considerations when establishing a web archiving program, including typical approaches that national libraries take when preserving the web. These include: collection development, legal issues, tools and approaches, staffing, and whether to do work in-house or outsource some or most of the work. The paper will introduce the International Internet Preservation Consortium and the benefits of collaboration when building web archives.
C3  - IFLA Congress 2017, Wroclaw, Poland
DA  - 2017///
PY  - 2017
UR  - http://library.ifla.org/1637/
Y2  - 2017/06/26/
ER  - 

TY  - CONF
TI  - The role of Internet Wayback Machine in a multi-method research project
AU  - Locatelli, Elisabetta
AB  - If, on the one side, the web offers us a platform where content is searchable and replicable, on the other one, it cannot be forgotten that web content is perishable, unstable and subject to continuous change. This is a challenge for scholarly research about the historical development of web. The research here presented analyzed the historical development of weblogs in Italy investigating their technological, cultural, economic, and institutional dimensions. The approach chosen mixed participant observation, in-depth interviews, and semiotic analysis of blogs and blog posts. Since an important part of the research was about the development of platforms, graphics, layouts, and technology, beside interviews older versions of blogs were retrieved using Internet Wayback Machine. Even if partial versions of the blogs were archived, this part of the research was important to complete data obtained with interviews and blogs’ analysis, since individual memory is not always accurate or some blogs were in the meanwhile closed and original posts were not accessible anymore.
C1  - London
C3  - “Researchers, pratictioners and their use of the archived web”, London, School of Advanced Study, University of London
DA  - 2017///
PY  - 2017
L4  - https://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-BruggerLocatelliWeberNanni-Web25.pdf
ER  - 

TY  - BOOK
TI  - Using web archives in research – an introduction
AU  - Nielsen, Janne
AB  - This book has been written in connection with the development of NetLab’s workshops on web archiving for researchers. These workshops provide the participants with an introduction to working with archived web materials in research, including a description of what web archiving is, the challenges of using archived web materials as an object of research, knowledge of existing web archives, and tools for micro archiving, so that researchers can themselves archive web materials. The purpose of this book is to gather and make available knowledge about the use of web archives for research. It is written in a Danish context and adapted to the needs of Danish researchers but can also be useful for other researchers. The
CY  - Aarhus
DA  - 2016///
PY  - 2016
ET  - 1.
SP  - 55
PB  - Netlab
SN  - 978-87-93533-00-4
L4  - http://www.netlab.dk/wp-content/uploads/2016/10/Nielsen_Using_Web_Archives_in_Research.pdf
ER  - 

TY  - CONF
TI  - Data Management of Web Archive Research Data
AU  - Bolette, Jurik
AU  - Zierau, Eld
AB  - This paper will provide recommendations to overcome various challenges for data management of web materials. The recommendations are based on results from two independent Danish research projects with different requirements to data management: The first project focuses on high precision on a par with traditional references for analogue material and with web materials found in different web archives. The second project focuses on large corpora (collections) of archived web references as basis for analysis.
C3  - “Researchers, pratictioners and their use of the archived web”, London, School of Advanced Study, University of London
DA  - 2017///
PY  - 2017
L4  - https://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-JurikZierau-Data_management_of_web_archive_research_data.pdf
ER  - 

TY  - RPRT
TI  - Web Archiving at National Libraries Findings of Stakeholders’ Consultation by the Internet Archive
AU  - Hockx-Yu, Helen
AB  - Internet Archive conducted a stakeholders’ consultation exercise between November 2015 and March 2016, with the aim to understand current practices, and then review Internet Archive’s current services in this light and explore new aspects for national libraries. This document reports on the consultation and summarises the findings.
DA  - 2016///
PY  - 2016
SP  - 19
PB  - Internet Archive
ER  - 

TY  - CONF
TI  - Capturing the Web at Large A Critique of Current Web Referencing Practices
AU  - Nyvang, Caroline
AU  - Kromann Hvid, Thomas
AU  - Zierau, Eld
AB  - The Internet and the cultural phenomena that exist online are increasingly attracting academic awareness, and e-materials both supplement and replace physical materials. These new opportunities come with a range of challenges. Websites are connected in new and unfamiliar ways, the amount of data easily surpasses what we have experienced previously, and we do not yet have an infrastructure that can lend prober support to the increased scholarly use of web resources [1-2]. This paper is an attempt to grapple with one of the core challenges, namely our ability to provide precise and persistent references to web material.1 The paper charts prevailing ideals and practices regarding web references within the Humanities. We highlight the challenges based on an analysis of web references in two case studies – a selection of Danish master’s theses from 2015 and academic books on contemporary Danish literature. We propose a new best practice that is consistent with good scientific practice in terms of both precision and persistency, which cannot be obtained following the existing standards.
C3  - “Researchers, pratictioners and their use of the archived web”, London, School of Advanced Study, University of London
DA  - 2017///
PY  - 2017
L4  - https://archivedweb.blogs.sas.ac.uk/files/2017/06/RESAW2017-NyvangKromannZierau-Capturing_the_web_at_large.pdf
ER  - 

TY  - CONF
TI  - Can web presence predict academic performance?
AU  - Gulyás, László
AU  - Jurányi, Zsolt
AU  - Soós, Sándor
AU  - Kampis, George
AB  - This paper reports the preliminary results of a project that aims at incorporating the analysis of the web presence (content) of research institutions into the scientometric analysis of these research institutions. The problem is to understand and predict the dynamics of academic activity and resource allocation using web presence. The present paper approaches this problem in two parts. First we develop a crawler and an archive of the web contents obtained from academic institutions, and present an early analysis of the records. Second, we use (currently off-line records to analyze the dynamics of resource allocation. Combination of the two parts is an ambition of ongoing work. The motivation in this study is twofold. First, we strongly believe that independent archiving, indexing and searching of (past) web content is an important task, even with regards to academic web presence. We are particularly interested in studying the dynamics of the ”online scientific discourse”, based on the assumption that the changing traces of web presence is an important factor that documents the intensity of activity. Second, we maintain that the trend-analysis of scientific activity represents a hitherto unused potential. We illustrate this by a pilot where, using ’offline’ longitudinal datasets, we study whether past (i.e. cumulative) success can predict current (and future) activity in academia. Or, in short: do institutions invest and publish in areas where they have been successful? Answer to this question is, we believe, important to understanding and predicting research policies and their changes.
C1  - New York, New York, USA
C3  - Proceedings of the 23rd International Conference on World Wide Web - WWW '14 Companion
DA  - 2014///
PY  - 2014
DO  - 10.1145/2567948.2579037
SP  - 1183
EP  - 1188
PB  - ACM Press
SN  - 978-1-4503-2745-9
UR  - http://dl.acm.org/citation.cfm?doid=2567948.2579037
ER  - 

TY  - CONF
TI  - Big is small, and changes slowly in Hungary
AU  - Kampis, György
AU  - Gulyás, László
AB  - The Internet Archive is incomplete and national archives are necessary. We report a pilot study in Hungary, targeting the archiving of the public internet content of academic research institutions, and present some early analysis results, indicating that the internet based “big data” is unexpectedly small for Hungary, and furthermore that this dataset changes at a low rate. We suggest that differences in the productivity of the institutions can be safely correlated with the differences in content refreshment in their internet presence.
C3  - Coginfo 2013 Conference
DA  - 2013///
PY  - 2013
ER  - 

TY  - CONF
TI  - Towards a national web in a federated country : a Belgian case study
AU  - Chambers, Sally
AU  - Mechant, Peter
AU  - Vandepontseele, Sophie
AU  - Isbergue, Nadège
AU  - Depoortere, Rolande
AB  - Although the .be domain was introduced in June 1988, the Belgian web is currently not systematically archived. As of August 2016, 1.550.147 domains are registered by DNS Belgium. Without a Belgian web archive, the content of these websites will not be preserved for future generations and a significant portion of Belgian history will be lost forever. In this paper we present the initial findings of a research project exploring the policy, legal, technical and scientific issues around archiving the Belgian web. The aim of this project is to a) identify current best practices in web-archiving b) pilot a Belgian web archive and c) identify research use cases for the scientific study of the Belgian web. This case study is seen as a first step towards implementing a long-term web archiving strategy for Belgium.
C3  - National Webs
DA  - 2016///
PY  - 2016
UR  - https://biblio.ugent.be/publication/8511255
Y2  - 2017/06/26/
KW  - web
KW  - archiving
KW  - digital humanities
KW  - Cultural Sciences
ER  - 

TY  - CONF
TI  - First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges
AU  - KRAGELJ, Matjaž
AU  - KOVAČIČ, Mitja (
AB  - The National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (*.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the *.si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.
C1  - Cape Town
C3  - Preservation and Conservation with Information Technology. IFLA 2015 South Africa
DA  - 2015///
PY  - 2015
PB  - IFLA -- International Federation of Library Associations and Institutions
L4  - http://library.ifla.org/1191/1/090-kragelj-en.pdf
KW  - web archiving
KW  - digital library
KW  - harvesting
KW  - national domain
KW  - social networks harvesting
ER  - 

TY  - CONF
TI  - A decade of web archiving in the National and University Library in Zagreb
AU  - Holub, Karolina
AU  - Rudomino, Ingeborg
AB  - Due to the dynamic nature of the web, its explosive growth, short lifespan, instability and similar characteristics, the importance of its archiving has become priceless for future generations. The National and University Library in Zagreb (Nacionalna i sveučilišna knjižnica u Zagrebu, NSK), as a memory institution responsible for collecting, cataloguing, archiving and providing access to all types of resources, recognized the significance of collecting and storing online content as part of the NSK's core activities. This is supported by positive legal environment since 1997 when Croatia passed the Law on libraries which subjected online publications to legal deposit. In 2004 NSK established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in collaboration with the University Computing Centre (Srce) and developed a system for capturing and archiving Croatian web resources. From 2004 to 2010 only selective archiving of web resources was conducted according to preestablished selection criteria. Taking into account NSK’s responsibility to preserve resources on Croatian social, scientific and cultural history, the importance of taking a snapshot of all publicly available resources under the national top level domain (.hr) was been recognized in 2011. Since then national domain harvestings have been conducted annually. In addition, in 2011 NSK started to run thematic harvestings of national importance. The paper will present the NSK's ten years’ experience in managing web resources with the emphasis on implementation of the system for selective and domain harvesting as well as the challenges for providing access to archived resources. Also, the harvested data from 2004 to 2014 will be analysed. The findings will illustrate the variability of URLs, frequency of harvesting and types of content. The data from the last four .hr harvestings will also be presented
C1  - Cape Town
C3  - Preservation and Conservation with Information Technology. IFLA 2015 South Africa
DA  - 2015///
PY  - 2015
PB  - IFLA -- International Federation of Library Associations and Institutions
L4  - http://library.ifla.org/1092/1/090-holub-en.pdf
KW  - legal deposit
KW  - : web archiving
KW  - Croatian Web Archive
KW  - national domain harvesting
KW  - selective harvesting
KW  - thematic harvesting
ER  - 

TY  - CONF
TI  - Growing a web archiving program: A case study for evolving an organization-management plan
AU  - Suomela, Todd
AB  - Web archiving presents a number of technical and organizational challenges for libraries. The University of Alberta Libraries has been using Archive-IT to manage a web archiving program for since 2009. This presentation will describe the history of web archiving at the University of Alberta and show the evolution of those services over time. Web archiving is not just technically challenging, it can also be organizationally challenging. Alberta has elected to use a distributed model for collection management by spreading the work for collection development and maintenance across subject librarians and library support staff. Some of the challenges of such a management plan include collection scoping, skill transfer, quality assurance, and metadata creation. The libraries also collaborate with regional and national consortia while working to expand services to researchers and casual users of the library. Attendees will takeaway lessons about collection management, collaboration, and research services for web archives.
C1  - Cape Town
C3  - Preservation and Conservation with Information Technology. IFLA 2015 South Africa
DA  - 2015///
PY  - 2015
PB  - IFLA
Y2  - 2017/06/23/
L4  - http://library.ifla.org/1088/1/090-suomela-en.pdf
KW  - Web archives
KW  - Distributed collaboration
KW  - Library Collections Management
ER  - 

TY  - CONF
TI  - ArchiveSpark
AU  - Holzmann, Helge
AU  - Goel, Vinay
AU  - Anand, Avishek
C1  - New York, New York, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2910902
SP  - 83
EP  - 92
PB  - ACM Press
SN  - 978-1-4503-4229-2
UR  - http://dl.acm.org/citation.cfm?doid=2910896.2910902
KW  - Web Archives
KW  - Big Data
KW  - Data Extraction
ER  - 

TY  - JOUR
TI  - Supporting student research with semantic technologies and digital archives
AU  - Martínez-García, Agustina
AU  - Corti, Louise
T2  - Technology, Pedagogy and Education
DA  - 2012/07//
PY  - 2012
DO  - 10.1080/1475939X.2012.704320
VL  - 21
IS  - 2
SP  - 273
EP  - 288
SN  - 1475-939X
UR  - http://www.tandfonline.com/doi/abs/10.1080/1475939X.2012.704320
ER  - 

TY  - JOUR
TI  - Linking Objects and their Stories: An API For Exploring Cultural Heritage Using Formal Concept Analysis
AU  - Eklund, Peter
AU  - Wray, Tim
AU  - Ducrou, Jon
T2  - Journal of Emerging Technologies in Web Intelligence
DA  - 2011/08/01/
PY  - 2011
DO  - 10.4304/jetwi.3.3.239-252
VL  - 3
IS  - 3
SN  - 1798-0461
UR  - http://www.jetwi.us/index.php?m=content&c=index&a=show&catid=157&id=883
ER  - 

TY  - JOUR
TI  - MIT's CWSpace project: packaging metadata for archiving educational content in DSpace
AU  - Reilly, William
AU  - Wolfe, Robert
AU  - Smith, MacKenzie
T2  - International Journal on Digital Libraries
DA  - 2006/04/20/
PY  - 2006
DO  - 10.1007/s00799-005-0131-2
VL  - 6
IS  - 2
SP  - 139
EP  - 147
SN  - 1432-5012
UR  - http://link.springer.com/10.1007/s00799-005-0131-2
ER  - 

TY  - CHAP
TI  - Creating and Consuming Metadata from Transcribed Historical Vital Records for Ingestion in a Long-Term Digital Preservation Platform
AU  - Grant, Dolores
AU  - Debruyne, Christophe
AU  - Grant, Rebecca
AU  - Collins, Sandra
T2  - Confederated International Workshops: OTM Academy, OTM Industry Case Studies Program, EI2N, FBM, INBAST, ISDE, META4eS, and MSC 2015 Rhodes, Greece, October 26–30, 2015, Proceedings
AB  - In the Irish Record Linkage 1864-1913 (IRL) project, digital archivists transcribe digitized register pages containing vital records into a database, which is then used to generate RDF triples. Historians then use those triples to answer some specific research questions on the IRL platform. Though the triples themselves are a highly valuable asset that can be adopted by many, the digitized records and their RDF representations need to be adequately stored and preserved according to best standards and guidelines to ensure those do not get lost over time. This was a problem currently not investigated within this project. This paper reports on the creation of Qualified Dublin Core from those triples for ingestion with the digitized register pages in an adequate long-term digital preservation platform and repository. Rather than creating RDF only for the purpose of this project, we demonstrate how we can distill artifacts from the RDF that is fit for discovery, access, and even reuse via that repository and how we elicit and conserve the knowledge and memories about Ireland, its history and culture contained in those register pages.
DA  - 2015///
PY  - 2015
SP  - 445
EP  - 450
SN  - 978-3-319-26138-6
UR  - http://link.springer.com/10.1007/978-3-319-26138-6_47
KW  - Linked data
KW  - Metadata
KW  - Mapping
KW  - Vital records
ER  - 

TY  - JOUR
TI  - Metadata for a Web Archive: PREMIS and XMP as Tools for the Task.
AU  - Romaniuk, Laurentia M
T2  - Library Philosophy & Practice
AB  - In a time where websites are ever changing, what metadata standards and tools are best for ensuring that web archive objects (such as snapshots of websites) are readable for users of the future? Can the evolution of web interfaces be documented? Initiatives that explore these questions already exist such as the Internet Archive's Wayback Machine (which stores source code from websites along with images); however, other archive building solutions are also available but have yet to be explored. The field of digital asset management (DAM), for example, has long examined how assets (digital files) are stored, organized, retrieved, and preserved. Best practices related to the use of metadata standards and tools found in digital asset management are useful and relevant to web archive building. In order to better understand the practicality of implementing DAM best practices in building a web archive, a small project was performed which involved cross-walking two metadata standards, Adobe's eXtensible Metadata Platform (XMP) and PREservation Metadata: Implementation Strategies (PREMIS), and recording metadata related to snapshots of a website, the Perseus Digital Library, over a span of over a decade. The findings of this project showed that it is impossible, at least in part, to encode PREMIS within XMP. [ABSTRACT FROM AUTHOR]
DA  - 2014/02/26/
PY  - 2014
SP  - 1
EP  - 20
SN  - 15220222
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=97212804&lang=hu&site=ehost-live
L4  - http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=2755&context=libphilprac
KW  - RESEARCH
KW  - Web archiving
KW  - Web archives
KW  - Digital libraries
KW  - Web
KW  - Metadata
KW  - Archive
KW  - Crosswalking
KW  - Digital Asset Management
KW  - Interface
KW  - PREMIS
KW  - Tags (Metadata)
KW  - XMP
ER  - 

TY  - CONF
TI  - WARCreate
AU  - Kelly, Mat
AU  - Weigle, Michele C.
AB  - The Internet Archive's Wayback Machine is the most common way that typical users interact with web archives. The Internet Archive uses the Heritrix web crawler to transform pages on the publicly available web into Web ARChive (WARC) files, which can then be accessed using the Wayback Machine. Because Heritrix can only access the publicly available web, many personal pages (e.g. password-protected pages, social media pages) cannot be easily archived into the standard WARC format. We have created a Google Chrome extension, WARCreate, that allows a user to create a WARC file from any webpage. Using this tool, content that might have been otherwise lost in time can be archived in a standard format by any user. This tool provides a way for casual users to easily create archives of personal online content. This is one of the first steps in resolving issues of "long term storage, maintenance, and access of personal digital assets that have emotional, intellectual, and historical value to individuals".
C1  - New York, New York, USA
C3  - Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries - JCDL '12
DA  - 2012///
PY  - 2012
DO  - 10.1145/2232817.2232930
SP  - 437
PB  - ACM Press
SN  - 978-1-4503-1154-0
UR  - http://dl.acm.org/citation.cfm?doid=2232817.2232930
ER  - 

TY  - CONF
TI  - Adaptive search systems for web archive research
AU  - Huurdeman, Hugo C.
AB  - The wealth of digital information available in our time has become indispensable for a rich variety of tasks. We use data on the Web for work, leisure, and research, aided by various search systems, allowing us to find small needles in giant haystacks. Despite recent advances in personalization and contextualization, however, various types of tasks, ranging from simple lookup tasks to complex, exploratory and ana- lytical ventures, are mainly supported in elementary, “one- size-fits-all” search interfaces. Web archives, keepers of our future cultural heritage, have gathered petabytes of valuable Web data, which characterize our times for future generations. Access to these archives, however, is surprisingly limited: online Web archives usu- ally provide a URL-based Wayback Machine interface, some- times extended with rudimentary search options. As a re- sult of limited access, Web archives have not been widely used for research so far. For emerging research using Web archives, there is a need to move beyond URL-based and simple search access, towards providing support for complex (re)search tasks. In my thesis, I am exploring ways to move beyond the “one-size-fits-all” approach for search systems, and I work on systems which can support the flow of complex search, also in the context of archived Web data. Rich models of search and research can be incorporated into adaptive search systems, supporting search strategies in various stages of complex search tasks. Concretely, I look at the use case of the Humanities researcher, for which the large, Terabyte- scale Web archives can be a valuable addition to existing sources utilized to perform research
C1  - New York, New York, USA
C3  - Proceedings of the 5th Information Interaction in Context Symposium on - IIiX '14
DA  - 2014///
PY  - 2014
DO  - 10.1145/2637002.2637063
SP  - 354
EP  - 356
PB  - ACM Press
SN  - 978-1-4503-2976-7
UR  - http://dl.acm.org/citation.cfm?doid=2637002.2637063
ER  - 

TY  - CONF
TI  - Analyzing web archives through topic and event focused sub-collections
AU  - Gossen, Gerhard
AU  - Demidova, Elena
AU  - Risse, Thomas
C1  - New York, New York, USA
C3  - Proceedings of the 8th ACM Conference on Web Science - WebSci '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2908131.2908175
SP  - 291
EP  - 295
PB  - ACM Press
SN  - 978-1-4503-4208-7
UR  - http://dl.acm.org/citation.cfm?doid=2908131.2908175
KW  - Web archive
KW  - events
KW  - sub-collection
KW  - topics
ER  - 

TY  - CONF
TI  - First steps in archiving the mobile web
AU  - Schneider, Richard
AU  - McCown, Frank
AB  - Smartphones and tablets are increas ingly used to access the Web, and many websites now provide alternative sites tailored specifically for these mobile devices. Web archivists are in need of tools to aid in archiving this equally ephemeral Mobile Web. We present Findmobile, a tool for automating the discovery of mobile websites. We tested our t ool in an experiment examining 10K popular websites and found that the most frequently used technique used by popular websites to direct mobile users to mobile sites was by automated client and server-side redirection. We found that nearly half of mob ile web pages differ dramatically from their stationary web counter parts and that the most popular websites are those most likely to have mobile-specific pages.
C1  - New York, New York, USA
C3  - Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL '13
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467735
SP  - 53
EP  - 56
PB  - ACM Press
SN  - 978-1-4503-2077-1
UR  - http://dl.acm.org/citation.cfm?doid=2467696.2467735
KW  - web archiving
KW  - web crawling
KW  - mobile web
ER  - 

TY  - CONF
TI  - Only One Out of Five Archived Web Pages Existed as Presented
AU  - Ainsworth, Scott G.
AU  - Nelson, Michael L.
AU  - Van de Sompel, Herbert
AB  - When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts "this is how the page looked at a that datetime." However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal coherence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1-4.1% but also reduces temporal coherence.
C1  - New York, New York, USA
C3  - Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2700171.2791044
SP  - 257
EP  - 266
PB  - ACM Press
SN  - 978-1-4503-3395-5
UR  - http://dl.acm.org/citation.cfm?doid=2700171.2791044
ER  - 

TY  - CONF
TI  - Content Selection and Curation for Web Archiving
AU  - Milligan, Ian
AU  - Ruest, Nick
AU  - Lin, Jimmy
C1  - New York, New York, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2910913
SP  - 107
EP  - 110
PB  - ACM Press
SN  - 978-1-4503-4229-2
UR  - http://dl.acm.org/citation.cfm?doid=2910896.2910913
L1  - https://cc.au.dk/fileadmin/user_upload/WARCnet/Milligan_You_shouldn_t_Need_to_be__2_.pdf
ER  - 

TY  - CONF
TI  - On Identifying the Bounds of an Internet Resource
AU  - Poursardar, Faryaneh
AU  - Shipman, Frank
AB  - Systems for retrieving or archiving Internet resources often assume a URI acts as a delimiter for the resource. But there are many situations where Internet resources do not have a one-to-one mapping with URIs. For URIs that point to the first page of a document that has been broken up over multiple pages, users are likely to consider the whole article as the resource, even though it is spread across multiple URIs. Comments, tags, ratings, and advertising might or might not be perceived as part of the resource whether they are retrieved as part of the primary URI or accessed via a link. Similarly, whether content accessible via links, tabs, or other navigation av ailable at the primary URI is perceived as part of the resource may depend on the design of the website. We are examining what people believe are the bounds of Internet resources with the hope of informing systems that better match user perceptions. To unders tand this challenge we explore a situation where the user is assumed to have identified a resource by a URI, particularly for archiving. To begin to answer these questions, we asked 110 participan ts how desirable it would be for web contents related to an id entified archived resource to also be archived. Results indicate that the features important to this decision likely vary considerably from resource to resource.
C1  - New York, New York, USA
C3  - Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval - CHIIR '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2854946.2854982
SP  - 305
EP  - 308
PB  - ACM Press
SN  - 978-1-4503-3751-9
UR  - http://dl.acm.org/citation.cfm?doid=2854946.2854982
ER  - 

TY  - CONF
TI  - On the Applicability of Delicious for Temporal Search on Web Archives
AU  - Holzmann, Helge
AU  - Nejdl, Wolfgang
AU  - Anand, Avishek
C1  - New York, New York, USA
C3  - Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2911451.2914724
SP  - 929
EP  - 932
PB  - ACM Press
SN  - 978-1-4503-4069-4
UR  - http://dl.acm.org/citation.cfm?doid=2911451.2914724
ER  - 

TY  - CONF
TI  - ArcLink
AU  - AlSum, Ahmed
AU  - Nelson, Michael L.
C1  - New York, New York, USA
C3  - Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL '13
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467751
SP  - 377
EP  - 378
PB  - ACM Press
SN  - 978-1-4503-2077-1
UR  - http://dl.acm.org/citation.cfm?doid=2467696.2467751
KW  - Design
KW  - Experimentation
ER  - 

TY  - CONF
TI  - Big Data Processing of School Shooting Archives
AU  - Farag, Mohamed
AU  - Nakate, Pranav
AU  - Fox, Edward A.
AB  - Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2% - 40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.
C1  - New York, New York, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2925466
SP  - 271
EP  - 272
PB  - ACM Press
SN  - 978-1-4503-4229-2
UR  - http://dl.acm.org/citation.cfm?doid=2910896.2925466
KW  - Big   Data   Proce
KW  - Classification
KW  - Digital
KW  - Libraries.
KW  - ssing
KW  - Web   Archives
ER  - 

TY  - CONF
TI  - Tempas
AU  - Holzmann, Helge
AU  - Anand, Avishek
C1  - New York, New York, USA
C3  - Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion
DA  - 2016///
PY  - 2016
DO  - 10.1145/2872518.2890555
SP  - 207
EP  - 210
PB  - ACM Press
SN  - 978-1-4503-4144-8
UR  - http://dl.acm.org/citation.cfm?doid=2872518.2890555
ER  - 

TY  - CONF
TI  - Mobile Mink
AU  - Jordan, Wesley
AU  - Kelly, Mat
AU  - Brunelle, Justin F.
AU  - Vobrak, Laura
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
AB  - We describe the mobile app \emph{Mobile Mink} which extends Mink, a browser extension that integrates the live and archived web. Mobile Mink discovers mobile and desktop URIs and provides the user an aggregated TimeMap of both mobile and desktop mementos. Mobile Mink also allows users to submit mobile and desktop URIs for archiving at the Internet Archive and Archive.today. Mobile Mink helps to increase the archival coverage of the growing mobile web.
C1  - New York, New York, USA
C3  - Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756956
SP  - 243
EP  - 244
PB  - ACM Press
SN  - 978-1-4503-3594-2
UR  - http://dl.acm.org/citation.cfm?doid=2756406.2756956
ER  - 

TY  - JOUR
TI  - Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content.
AU  - Jones, Shawn M
AU  - Van de Sompel, Herbert
AU  - Shankar, Harihar
AU  - Klein, Martin
AU  - Tobin, Richard
AU  - Grover, Claire
T2  - PLoS ONE
AB  - Increasingly, scholarly articles contain URI references to “web at large” resources including project web sites, scholarly wikis, ontologies, online debates, presentations, blogs, and videos. Authors reference such resources to provide essential context for the research they report on. A reader who visits a web at large resource by following a URI reference in an article, some time after its publication, is led to believe that the resource’s content is representative of what the author originally referenced. However, due to the dynamic nature of the web, that may very well not be the case. We reuse a dataset from a previous study in which several authors of this paper were involved, and investigate to what extent the textual content of web at large resources referenced in a vast collection of Science, Technology, and Medicine (STM) articles published between 1997 and 2012 has remained stable since the publication of the referencing article. We do so in a two-step approach that relies on various well-established similarity measures to compare textual content. In a first step, we use 19 web archives to find snapshots of referenced web at large resources that have textual content that is representative of the state of the resource around the time of publication of the referencing paper. We find that representative snapshots exist for about 30% of all URI references. In a second step, we compare the textual content of representative snapshots with that of their live web counterparts. We find that for over 75% of references the content has drifted away from what it was when referenced. These results raise significant concerns regarding the long term integrity of the web-based scholarly record and call for the deployment of techniques to combat these problems. [ABSTRACT FROM AUTHOR]
DA  - 2016/12/02/
PY  - 2016
VL  - 11
IS  - 12
SP  - 1
EP  - 32
SN  - 19326203
UR  - http://10.0.5.91/journal.pone.0167475
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Medicine and health sciences
KW  - Research and analysis methods
KW  - Archives
KW  - Crawling
KW  - WEBSITES
KW  - Internet
KW  - Algorithms
KW  - UNIFORM Resource Identifiers
KW  - Amniotes
KW  - Animals
KW  - Applied mathematics
KW  - Biological locomotion
KW  - Biology and life sciences
KW  - Biomechanics
KW  - Birds
KW  - BLOGS
KW  - Cats
KW  - Computer and information sciences
KW  - Computer networks
KW  - Data management
KW  - Information centers
KW  - Mammals
KW  - Mathematics
KW  - Ontologies
KW  - ONTOLOGIES (Information retrieval)
KW  - Organisms
KW  - Physical sciences
KW  - Physiology
KW  - Research Article
KW  - Research facilities
KW  - Simulation and modeling
KW  - Species interactions
KW  - Vertebrates
KW  - WIKIS (Computer science)
ER  - 

TY  - JOUR
TI  - Self-Indexing RDF Archives.
AU  - Cerdeira-Pena, Ana
AU  - Farina, Antonio
AU  - Fernandez, Javier D
AU  - Martinez-Prieto, Miguel A
T2  - 2016 Data Compression Conference (DCC)
AB  - Although Big RDF management is an emerging topic in the so-called Web of Data, existing techniques disregard the dynamic nature of RDF data. These RDF archives evolve over time and need to be preserved and queried across it. This paper presents v-RDFCSA, an RDF archiving solution that extends RDFCSA (an RDF self-index) to provide versionbased queries on top of compressed RDF archives. Our experiments show that v-RDFCSA reduces space requirements up to 35 − 60 times over a state-of-the-art baseline, and gets more than one order of magnitude ahead over it for query resolution.
DA  - 2016/01//
PY  - 2016
SP  - 526
EP  - 535
SN  - 9781509018536
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://ieeexplore.ieee.org/abstract/document/7786197/
ER  - 

TY  - CHAP
TI  - How to Search the Internet Archive Without Indexing It
AU  - Kanhabua, Nattiya
AU  - Kemkes, Philipp
AU  - Nejdl, Wolfgang
AU  - Nguyen, Tu Ngoc
AU  - Reis, Felipe
AU  - Tran, Nam Khanh
T2  - Research & Advanced Technology for Digital Libraries: 20th International Conference on Theory & Practice of Digital Libraries, TPDL 2016, Hannover, Germany, September 5-9, 2016, Proceedings
AB  - Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today’s web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives
DA  - 2016/01//
PY  - 2016
SP  - 147
EP  - 160
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://link.springer.com/chapter/10.1007/978-3-319-43997-6_12
L4  - http://link.springer.com/10.1007/978-3-319-43997-6_12
L4  - https://arxiv.org/pdf/1701.08256.pd
ER  - 

TY  - JOUR
TI  - Archiving Software Surrogates on the Web for Future Reference.
AU  - Holzmann, Helge
AU  - Sperber, Wolfram
AU  - Runnwerth, Mila
T2  - Research & Advanced Technology for Digital Libraries: 20th International Conference on Theory & Practice of Digital Libraries, TPDL 2016, Hannover, Germany, September 5-9, 2016, Proceedings
AB  - Software has long been established as an essential aspect of the scientific process in mathematics and other disciplines. However, reliably referencing software in scientific publications is still challenging for various reasons. A crucial factor is that software dynamics with temporal versions or states are difficult to capture over time. We propose to archive and reference surrogates instead, which can be found on the Web and reflect the actual software to a remarkable extent. Our study shows that about a half of the webpages of software are already archived with almost all of them including some kind of documentation.
DA  - 2016/01//
PY  - 2016
SP  - 215
SN  - 9783319439969
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L1  - https://arxiv.org/pdf/1702.01163
L4  - http://link.springer.com/chapter/10.1007/978-3-319-43997-6_17
KW  - Web Archives
KW  - Analysis
KW  - Scientific Software Management
ER  - 

TY  - JOUR
TI  - What does the Web remember of its deleted past? An archival reconstruction of the former Yugoslav top-level domain.
AU  - Ben-David, Anat
T2  - New Media & Society
AB  - This article argues that the use of the Web as a primary source for studying the history of nations is conditioned by the structural ties between sovereignty and the Internet protocol, and by a temporal proximity between live and archived websites. The argument is illustrated by an empirical reconstruction of the history of the top-level domain of Yugoslavia (.yu), which was deleted from the Internet in 2010. The archival discovery method used four lists of historical .yu Uniform Resource Locators (URLs) that were captured from the live Web before the domain was deleted, and an automated hyperlink discovery script that retrieved their snapshots from the Internet Archive and reconstructed their immediate hyperlinked environment in a network. Although a considerable portion of the historical .yu domain was found on the Internet Archive, the reconstructed space was predominantly Serbian. [ABSTRACT FROM AUTHOR]
DA  - 2016/08//
PY  - 2016
VL  - 18
IS  - 7
SP  - 1103
SN  - 14614448
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://nms.sagepub.com/content/early/2016/04/27/1461444816643790.abstract
KW  - Web history
KW  - Internet Archive
KW  - Web archives
KW  - WEB archives
KW  - Yugoslavia
KW  - Wayback Machine
KW  - WEBSITES
KW  - Country code top-level domain
KW  - digital heritage
KW  - HYPERLINKS
KW  - INTERNET Archive (Firm)
KW  - Internet Corporation for Assigned Names and Number
KW  - INTERNET protocols
KW  - national Webs
KW  - Serbia
KW  - SOVEREIGNTY (Political science)
KW  - UNIFORM Resource Locators
ER  - 

TY  - JOUR
TI  - The Design of a Cloud-Based Website Parallel Archiving System.
AU  - Chao, David
AU  - Gill, Sam
T2  - Issues in Information Systems
AB  - Many business applications are designed and organized to support business activities for a period of time and to be renewed at the turn of the period. Design changes are typically implemented in a revision of the application that supports future periods to assure smooth operation. Very often the applications supporting the previous periods need to be operational continuously even after the application for the new period started. Parallel operation of current and previous periods' applications may be problematic for web-based applications due to the rapid change in Internet technologies. Cloud computing provides a solution to this problem with the capability of offering virtual servers with user-specified configurations. This paper proposes a parallel archiving scheme that uses virtual server to run each period's application in a cloud platform so that previous periods' applications will run in parallel with the current period system and forms an easy-to-access archive for historical data. [ABSTRACT FROM AUTHOR]
DA  - 2015/01//
PY  - 2015
VL  - 16
IS  - 1
SP  - 226
SN  - 15297314
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.iacis.org/iis/2015/1_iis_2015_226-231.pdf
KW  - WEB archiving
KW  - Cloud Computing
KW  - CLOUD computing
KW  - VIRTUAL machine systems
KW  - Virtualization
KW  - Website Archiving
ER  - 

TY  - JOUR
TI  - Agent-based Approach to WEB Exploration Process
AU  - Opalinski, Andrzej
AU  - Nawarecki, Edward
AU  - Kluska-Nawarecka, Stanislawa
T2  - Procedia Computer Science
AB  - The paper contains the concept of agent-based search system and monitoring of Web pages. It is oriented at the exploration of limited problem area, covering a given sector of industry or economy. The proposal of agent-based (modular) structure of the system is due to the desire to ease the introduction of modifications or enrichment of its functionality. Commonly used search engines do not offer such a feature. The second part of the article presents a pilot version of the WEB mining system, represent- ing a simplified implementation of the previously presented concept. Testing of the implemented application was executed by referring to the problem area of foundry industry.
DA  - 2015/01/01/
PY  - 2015
DO  - 10.1016/j.procs.2015.05.263
VL  - 51
IS  - International Conference On Computational Science, ICCS 2015
SP  - 1052
EP  - 1061
SN  - 1877-0509
UR  - http://10.0.3.248/j.procs.2015.05.263
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
ER  - 

TY  - JOUR
TI  - Into the Dark Domain: The UK Web Archive as a Source for the Contemporary History of Public Health.
AU  - Gorsky, Martin
T2  - Social History of Medicine
AB  - With the migration of the written record from paper to digital format, archivists and historians must urgently consider how web content should be conserved, retrieved and analysed. The British Library has recently acquired a large number of UK domain websites, captured 1996-2010, which is colloquially termed the Dark Domain Archive while technical issues surrounding user access are resolved. This article reports the results of an invited pilot project that explores methodological issues surrounding use of this archive. It asks how the relationship between UK public health and local government was represented on the web, drawing on the 'declinist' historiography to frame its questions. It points up some difficulties in developing an aggregate picture of web content due to duplication of sites. It also high lights their potential for thematic and discourse analysis, using both text and image, illustrated through an argument about the contradictory rationale for public health policy under New Labour. [ABSTRACT FROM AUTHOR]
DA  - 2015/08//
PY  - 2015
VL  - 28
IS  - 3
SP  - 596
SN  - 0951631X
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://shm.oxfordjournals.org/content/28/3/596.full
KW  - methodology
KW  - websites
KW  - BRITISH Library
KW  - HISTORY -- Sources -- Computer network resources
KW  - INTERNET -- History
KW  - local government
KW  - public health
KW  - PUBLIC health -- Computer network resources
KW  - PUBLIC health -- History
KW  - WEBSITES -- History
ER  - 

TY  - JOUR
TI  - Building a Future for Our Digital Memory: A Collaborative Infrastructure for Permanent Access to Digital Heritage in The Netherlands
AU  - Ras, Marcel
AU  - Sierman, Barbara
T2  - New Review of Information Networking
AB  - This article describes the developments in The Netherlands to establish a national Network for Digital Heritage. This network is based on three pillars: to make the digital heritage visible, usable, and sustainably preserved. Three working programs will have their own but integrated set of dedicated actions in order to create a national infrastructure in The Netherlands, based on an optimal use of existing facilities. In this article the focus is on the activities related to the sustainable preservation of the Dutch national digital heritage.
DA  - 2015/07/03/
PY  - 2015
DO  - 10.1080/13614576.2015.1114828
VL  - 20
IS  - 1-2
SP  - 219
EP  - 228
SN  - 1361-4576
UR  - http://www.tandfonline.com/doi/full/10.1080/13614576.2015.1114828
ER  - 

TY  - JOUR
TI  - Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts.
AU  - Samar, Thaer
AU  - Traub, Myriam C
AU  - van Ossenbruggen, Jacco
AU  - de Vries, Arjen P
T2  - Research & Advanced Technology for Digital Libraries: 20th International Conference on Theory & Practice of Digital Libraries, TPDL 2016, Hannover, Germany, September 5-9, 2016, Proceedings
DA  - 2016/01//
PY  - 2016
SP  - 133
SN  - 9783319439969
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://link.springer.com/chapter/10.1007/978-3-319-43997-6_11
ER  - 

TY  - CHAP
TI  - ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections
AU  - Fernando, Zeon Trevor
AU  - Marenzi, Ivana
AU  - Nejdl, Wolfgang
AU  - Kalyani, Rishita
T2  - Research & Advanced Technology for Digital Libraries: 20th International Conference on Theory & Practice of Digital Libraries, TPDL 2016, Hannover, Germany, September 5-9, 2016, Proceedings
DA  - 2016/01//
PY  - 2016
SP  - 107
EP  - 118
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://link.springer.com/10.1007/978-3-319-43997-6_9
L4  - https://arxiv.org/pdf/1702.00198.pdf
ER  - 

TY  - CHAP
TI  - Digital Preservation Metadata Practice for Web Archives.
AU  - Oury, Clément
AU  - Blumenthal, Karl-Rainer
AU  - Peyrard, Sébastien
T2  - Digital Preservation Metadata for Practitioners
DA  - 2016///
PY  - 2016
SP  - 59
EP  - 82
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://link.springer.com/content/pdf/10.1007/978-3-319-43763-7.pdf#page=68
ER  - 

TY  - JOUR
TI  - The Cobweb
AU  - LEPORE, JILL
T2  - New Yorker
AB  - The article discusses efforts to archive historic Internet content, highlighting the Internet Archive nonprofit library in California. Topics addressed include the views and career history of Internet Archive founder Brewster Kahle, the legal aspects of the archive in terms of legal-deposit laws and copyright, and the Internet Archive's operations out of an old church building.
DA  - 2015/01/26/
PY  - 2015
VL  - 90
IS  - 45
SP  - 34
EP  - 41
SN  - 0028792X
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.c-i-r-c-u-l-a-t-i-o-n.org/MEDIA/PDF/The-Cobweb.pdf
KW  - WEB archiving
KW  - WEB archives
KW  - INTERNET Archive (Firm)
KW  - etc.
KW  - Brewster
KW  - KAHLE
KW  - LEGAL deposit of books
ER  - 

TY  - JOUR
TI  - Webarchivierung an der Bayerischen Staatsbibliothek. (German)
AU  - Beinert, Tobias
T2  - Web archiving at the Bayerische Staatsbibliothek. (English)
AB  - The Bayerische Staatsbibliothek has been collecting and archiving websites dealing with regional studies and science since the year 2010. The article provides a survey of the collection and archiving profiles of the Bayerische Staatsbibliothek concerning websites, the legal basis, the workflow which has been developed as well as the registration and making available of websites in the archives. Finally, further perspectives for the future are presented. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017/06//
PY  - 2017
VL  - 51
IS  - 6
SP  - 490
SN  - 00061972
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.degruyter.com/view/j/bd.2017.51.issue-6/bd-2017-0052/bd-2017-0052.xml
KW  - web archiving
KW  - Bavaria
KW  - Bayern
KW  - Langzeitarchivierung
KW  - long-term archiving
KW  - Webarchivierung
ER  - 

TY  - JOUR
TI  - Web-Archivierung an der Saarländischen Universitäts- und Landesbibliothek (SULB). (German)
AU  - Dupuis, Caroline
T2  - Web archiving at the Saarland University and State Library (Saarländische Universitätsund Landesbibliothek, SULB). (English)
AB  - Since 2008, websites are being archived at the Saarland University and State Library (SULB). The repository SaarDok is available for archiving electronic documents concerning regional studies. A legal basis for depositing non-physical works exists since December 2015. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017/06//
PY  - 2017
VL  - 51
IS  - 6
SP  - 529
SN  - 00061972
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.degruyter.com/view/j/bd.2017.51.issue-6/bd-2017-0055/bd-2017-0055.xml
KW  - Langzeitarchivierung
KW  - long time archiving
KW  - SaarDok
KW  - Saarland Media Law
KW  - Saarländisches Mediengesetz
KW  - Webseiten
KW  - websites
ER  - 

TY  - CONF
TI  - The Dawn of Today's Popular Domains
AU  - Holzmann, Helge
AU  - Nejdl, Wolfgang
AU  - Anand, Avishek
AB  - The Web has been around and maturing for 25 years. The popular websites of today have undergone vast changes dur- ing this period, with a few being there almost since the beginning and many new ones becoming popular over the years. This makes it worthwhile to take a look at how these sites have evolved and what they might tell us about the future of the Web. We therefore embarked on a longitudinal study spanning almost the whole period of the Web, based on data collected by the Internet Archive starting in 1996, to retrospectively analyze how the popular Web as of now has evolved over the past 18 years. For our study we focused on the German Web, specifically on the top 100 most popular websites in 17 categories. This paper presents a selection of the most interesting findings in terms of volume, size as well as age of the Web. While re- lated work in the field of Web Dynamics has mainly focused on change rates and analyzed datasets spanning less than a year, we looked at the evolution of websites over 18 years. We found that around 70% of the pages we investigated are younger than a year, with an observed exponential growth in age as well as in size up to now. If this growth rate con- tinues, the number of pages from the popular domains will almost double in the next two years. In addition, we give insights into our data set, provided by the Internet Archive, which hosts the largest and most complete Web archive as of today.
C1  - New York, New York, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2910901
SP  - 73
EP  - 82
PB  - ACM Press
SN  - 978-1-4503-4229-2
UR  - http://dl.acm.org/citation.cfm?doid=2910896.2910901
ER  - 

TY  - JOUR
TI  - Zum Stand der Webarchivierung in Baden-Württemberg. (German)
AU  - Geisler, Felix
AU  - Dannehl, Wiebke
AU  - Keitel, Christian
AU  - Wolf, Stefan
T2  - Web archiving - the present situation in Baden-Württemberg. (English)
AB  - The article describes the present situation of web archiving at the level of the Land of Baden-Württemberg. The essential legal basis for collecting and archiving websites and related single documents exists. At the same time, the selection of contents and technical realization of the offer is an extensive task that is shared among several state institutions. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017/06//
PY  - 2017
VL  - 51
IS  - 6
SP  - 481
SN  - 00061972
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.degruyter.com/view/j/bd.2017.51.issue-6/bd-2017-0051/bd-2017-0051.xml
KW  - web archiving
KW  - digital archives
KW  - Webarchivierung
KW  - Baden-Württemberg
KW  - digital cultural assets
KW  - Digitales Archiv
KW  - Digitales Kulturgut
KW  - electronic deposit copy
KW  - Elektronisches Pflichtexemplar
KW  - Forschungsdaten
KW  - research data
KW  - rttemberg
ER  - 

TY  - BOOK
TI  - Semantic recommender system for the recovery of the preserved web heritage
AU  - Portilla, Omar
AU  - Aguilar, José
AU  - León, Claudia
AB  - This paper presents a prototype of a semantic personalized recommender system for a repository of preserved web files. To do this, we design and implement a semantic repository of preserved web files, containing metadata associated with each preserved site. The knowledge stored in the metadata of the semantic repository is used for the recommender system, in order to give prioritized recommendations of the different preserved web files (or web heritage) that meet certain search criteria. The proposed recommender also considers semantic associations, in order to recommend not only the websites matched to the search criteria, but also semantically related.
DA  - 2015///
PY  - 2015
SP  - 1
PB  - IEEE
SN  - 978-1-4673-9143-6
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://ieeexplore.ieee.org/abstract/document/7359467/
L4  - http://eventos.spc.org.pe/clei2015/pdfs/144278.pdf
KW  - web archiving
KW  - Communication
KW  - Computing and Processing
KW  - Networking and Broadcast Technologies
KW  - Robotics and Control Systems
KW  - Signal Processing and Analysis
KW  - Metadata
KW  - Engineering Profession
KW  - Knowledge engineering
KW  - patrimony web
KW  - Prototypes
KW  - Recommender systems
KW  - Semantic recommender
KW  - Semantics
KW  - web service
KW  - Web services
ER  - 

TY  - JOUR
TI  - Webarchivierung in der SUB Hamburg: kleine Schritte in der Region - Bausteine zu einem größeren Ganzen? (German)
AU  - Hagenah, Ulrich
T2  - Web archiving at the Hamburg State and University Library (SUB): small steps in the region - components of a larger whole? (English)
AB  - Since 2015, the Hamburg State and University Library has been collecting websites of Hamburg institutions or websites concerning regional topics as part of its responsibility as state and depository library. The article describes the legal and technical basis, selection principles, the processing workflow including cataloging as well as how to access the websites' archive copies. Technical and organizational limits and the status of web archiving are discussed in the state library context. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017/06//
PY  - 2017
VL  - 51
IS  - 6
SP  - 500
SN  - 00061972
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.degruyter.com/view/j/bd.2017.51.issue-6/bd-2017-0053/bd-2017-0053.xml
KW  - cultural heritage
KW  - Archivierung
KW  - archiving
KW  - deposit copy
KW  - Hamburg
KW  - Kulturelles Erbe
KW  - Landesbibliothek
KW  - Landeskunde
KW  - Pflichtexemplar
KW  - regional studies
KW  - state library
KW  - Website
ER  - 

TY  - JOUR
TI  - When time meets information retrieval: Past proposals, current plans and future trends.
AU  - Moulahi, Bilel
AU  - Tamine, Lynda
AU  - Yahia, Sadok Ben
T2  - Journal of Information Science
AB  - With the advent of Web search and the large amount of data published on the Web sphere, a tremendous amount of documents become strongly time-dependent. In this respect, the time dimension has been extensively exploited as a highly important relevance criterion to improve the retrieval effectiveness of document ranking models. Thus, a compelling research interest is going on the temporal information retrieval realm, which gives rise to several temporal search applications. In this article, we intend to provide a scrutinizing overview of time-aware information retrieval models. We specifically put the focus on the use of timeliness and its impact on the global value of relevance as well as on the retrieval effectiveness. First, we attempt to motivate the importance of temporal signals, whenever combined with other relevance features, in accounting for document relevance. Then, we review the relevant studies standing at the crossroads of both information retrieval and time according to three common information retrieval aspects: the query level, the document content level and the document ranking model level. We organize the related temporal-based approaches around specific information retrieval tasks and regarding the task at hand, we emphasize the importance of results presentation and particularly timelines to the end user. We also report a set of relevant research trends and avenues that can be explored in the future. [ABSTRACT FROM AUTHOR]
DA  - 2016/12//
PY  - 2016
VL  - 42
IS  - 6
SP  - 725
SN  - 01655515
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://journals.sagepub.com/doi/abs/10.1177/0165551515607277
KW  - INFORMATION retrieval
KW  - INTERNET searching
KW  - QUERYING (Computer science)
KW  - RANKING
KW  - Relevance
KW  - temporal queries
KW  - temporal ranking
KW  - time
KW  - timelines
KW  - WEB search engines
ER  - 

TY  - JOUR
TI  - The End of the Television Archive as We Know It? The National Archive as an Agent of Historical Knowledge in the Convergence Era
AU  - Hagedoorn, Berber
AU  - Agterberg, Bas
T2  - Media and Communication, Vol 4, Iss 3, Pp 162-175 (2016) VO - 4
AB  - Professionals in the television industry are working towards a certain future—rather than end—for the medium based on multi-platform storytelling, as well as multiple screens, distribution channels and streaming platforms. They do so rooted in institutional frameworks where traditional conceptualizations of television still persist. In this context, we reflect on the role of the national television archive as an agent of historical knowledge in the convergence era. Contextualisation and infrastructure function as important preconditions for users of archives to find their way through the enormous amounts of audio-visual material. Specifically, we consider the case of the Netherlands Institute for Sound and Vision, taking a critical stance towards the archive’s practices of contextualisation and preservation of audio-visual footage in the convergence era. To do so, this article considers the impact of online circulation, contextualisation and preservation of audio-visual materials in relation to, first, how media policy complicates the re-use of material, and second, the archive’s use by television professionals and media researchers. This article reflects on the possibilities for and benefits of systematic archiving, developments in web archiving, and accessibility of production and contextual documentation of public broadcasters in the Netherlands. We do so based on an analysis of internal documentation, best practices of archive-based history programmes and their related cross-media practices, as well as media policy documentation. We consider how audio-visual archives should deal with the shift towards multi-platform productions, and argue for both a more systematic archiving of production and contextual documentation in the Netherlands, and for media researchers who draw upon archival resources to show a greater awareness of an archive’s history. In the digital age, even more people are part of the archive’s processes of selection and aggregation, affecting how the past is preserved through audio-visual images.
DA  - 2016///
PY  - 2016
DO  - 10.17645/mac.v4i3.595
IS  - 3
SP  - 162
SN  - 2183-2439
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.cogitatiopress.com/ojs/index.php/mediaandcommunication/article/view/595
KW  - archival footage
KW  - broadcasting
KW  - Communication. Mass media
KW  - convergence
KW  - cross-media
KW  - digital media
KW  - history programming
KW  - media policy
KW  - online circulation
KW  - preservation and contextualization practices
KW  - production research documentation
ER  - 

TY  - JOUR
TI  - The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and Beyond through Internet Research.
AU  - Black, Michael L
T2  - International Journal of Humanities & Arts Computing: A Journal of Digital Humanities
AB  - While intellectual property protections effectively frame digital humanities text mining as a field primarily for the study of the nineteenth century, the Internet offers an intriguing object of study for humanists working in later periods. As a complex data source, the World Wide Web presents its own methodological challenges for digital humanists, but lessons learned from projects studying large nineteenth century corpora offer helpful starting points. Complicating matters further, legal and ethical questions surrounding web scraping, or the practice of large scale data retrieval over the Internet, will require humanists to frame their research to distinguish it from commercial and malicious activities. This essay reviews relevant research in the digital humanities and new media studies in order to show how web scraping might contribute to humanities research questions. In addition to recommendations for addressing the complex concerns surrounding web scraping this essay also provides a basic overview of the process and some recommendations for resources. [ABSTRACT FROM AUTHOR]
DA  - 2016/03//
PY  - 2016
VL  - 10
IS  - 1
SP  - 95
EP  - 109
SN  - 17538548
UR  - http://10.0.13.38/ijhac.2016.0162
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - RESEARCH
KW  - WORLD Wide Web
KW  - INFORMATION retrieval
KW  - webscraping
KW  - world wide web
KW  - 20th century
KW  - DATA mining
KW  - intellectual property
KW  - INTELLECTUAL property
KW  - text mining
KW  - TEXT mining (Information retrieval)
ER  - 

TY  - JOUR
TI  - Vědecké využití dat z webových archivů.
AU  - Kvasnica, Jaroslav
AU  - Rudišinová, Barbora
AU  - Kreibich, Rudolf
T2  - Research use of web archived data.
AB  - Major part of our communication and media production has moved from traditional print media into digital universe. Digital content on the web is diverse and fluid; it emerges, changes and disappears every day. Such content is unique and valuable from academic perspective but as it disappears over time, we are losing the ability to study recent history. Web archives are now taking the responsibility to capture and preserve such content for future research. Web archives preserve vast amounts of data captured over the years and one of the main goals now is to improve research usability of their collections. This article describes the way the web archives store web content and related metadata and summarizes several recent studies that have dealt with research requirements for web archived data. Based on conclusion of these studies, it suggests further actions to establish cooperation with research community. (English) [ABSTRACT FROM AUTHOR]
DA  - 2016/12//
PY  - 2016
VL  - 27
IS  - 2
SP  - 24
EP  - 34
SN  - 18013252
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=120430576&lang=hu&site=ehost-live
L1  - http://knihovnarevue.nkp.cz/archiv/dokumenty/2016-2/Kvasnica.pdf
KW  - big data
KW  - web archiving
KW  - metadata
KW  - WARC
KW  - analýza dat
KW  - badatelé
KW  - data analysis
KW  - researchers
KW  - velká data
KW  - webová archivace
ER  - 

TY  - JOUR
TI  - Arhivarea Paginilor Web - Initiative Relevante de Pastrare a Patrimoniului Digital European
AU  - Boruna, Adriana Elena
AU  - Rahme, Nicoleta
T2  - Biblioteca Nationala a Romaniei. Informare si Documentare
DA  - 2011///
PY  - 2011
VL  - 4
SP  - 39
EP  - 52,
LA  - Romanian
SN  - 20651058
UR  - https://search.proquest.com/docview/1443688144?accountid=27464
KW  - Sciences: Comprehensive Works
ER  - 

TY  - JOUR
TI  - Web Archiving in the National and University Library
AU  - Kavcic-Colic, Alenka
AU  - Klasinc, Janko
T2  - Knjiznica
AB  - The National and University Library (NUK) of Slovenia has been investigating web archiving methods and techniques since 2001. Under the new Legal Deposit Law adopted in 2006, NUK is the responsible institution for harvesting and archiving the Slovenian web. In 2008 NUK started archiving the Slovenian web by making use of the web harvesting and access tools developed by the IIPC International Internet Preservation Consortium (IIPC). The paper presents the complexity of web harvesting and gives an overview of the international practice and NUK's cooperation in the IIPC consortium. Special attention is given to the analysis of public sector web content, harvested since 2008. Main goals of future development of the web archive are an increase of harvested Slovenian web sites, the development of a user interface for public access and development of improved methods for harvesting technically problematic content. Adapted from the source document.
DA  - 2011///
PY  - 2011
VL  - 55
IS  - 1
SP  - 209
EP  - 232
LA  - Slovene
SN  - 0023-2424, 0023-2424
UR  - https://search.proquest.com/docview/1266143501?accountid=27464
L4  - https://knjiznica.zbds-zveza.si/knjiznica/article/download/6011/5658
KW  - Web archiving
KW  - Web archives
KW  - National libraries
KW  - Legal deposit
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - University libraries
KW  - Slovenia
KW  - National and University Library
KW  - Public sector
KW  - Slo
ER  - 

TY  - JOUR
TI  - Observations on the development of non-print legal deposit in the UK
AU  - Gibby, Richard
AU  - Brazier, Caroline
T2  - Library Review
AB  - Purpose - The process of developing and implementing UK legislation for the legal deposit of electronic and other non-print publications has been lengthy and remains incomplete, although the Government has consulted on draft regulations for implementation in 2013. The purpose of this paper is to provide a short account of progress and review the experience, analysing several factors that have influenced the legislative process and helped shape the proposed regulations. It summarises the regulatory and non-regulatory steps taken by the UK legal deposit libraries to address the legitimate concerns of publishers and describes some of the practical implications of implementing legal deposit for non-print publications. Design/methodology/approach - The paper draws upon the personal experiences of the authors, who have been directly involved in the legislative process and negotiations with publishers and other stakeholders. Findings - The paper provides new information and a summary of key issues and outcomes, with explanations and some insights into the factors that have influenced them. Originality/value - This paper provides new information about the development of legal deposit in the UK and a review of the issues that have affected its progress.
DA  - 2012///
PY  - 2012
DO  - http://dx.doi.org/10.1108/00242531211280487
VL  - 61
IS  - 5
SP  - 362
EP  - 377
LA  - English
SN  - 00242535
UR  - https://search.proquest.com/docview/1080973857?accountid=27464
L4  - http://www.emeraldinsight.com/doi/full/10.1108/00242531211280487
KW  - Library And Information Sciences
KW  - Archives & records
KW  - United Kingdom--UK
KW  - Libraries
KW  - Metadata
KW  - Publications
ER  - 

TY  - JOUR
TI  - Archiving before Loosing Valuable Data? Development of Web Archiving in Europe
AU  - Lasfargues, France
AU  - Martin, Chloé
AU  - Medjkoune, Leïla
T2  - Bibliothek Forschung und Praxis
AB  - Web content is, by nature, ephemeral: sites are updated regularly and disappear, which involves the loss of unique value information. The importance of this media grows continuously in our society and institutions are developing websites with a variety of content creating a large media-centric Web sphere. Like any media, it is essential to preserve it as a key part of our heritage.
DA  - 2012/01//
PY  - 2012
DO  - 10.1515/bfp-2012-0014
VL  - 36
IS  - 1
SP  - 117
EP  - 124
LA  - English
SN  - 1865-7648
UR  - https://search.proquest.com/docview/1532083850?accountid=27464
L4  - https://www.degruyter.com/view/j/bfup.2012.36.issue-1/bfp-2012-0014/bfp-2012-0014.xml
L4  - https://pdfs.semanticscholar.org/0a63/dc2723241009b0465cad4d736e6777591ae7.pdf
KW  - Web archiving
KW  - Library And Information Sciences
KW  - preservation
KW  - state of the art
ER  - 

TY  - JOUR
TI  - Who and what links to the Internet Archive
AU  - AlNoamany, Yasmin
AU  - AlSum, Ahmed
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - International Journal on Digital Libraries
AB  - Issue Title: 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013) The Internet Archive's (IA) Wayback Machine is the largest and oldest public Web archive and has become a significant repository of our recent history and cultural heritage. Despite its importance, there has been little research about how it is discovered and used. Based on Web access logs, we analyze what users are looking for, why they come to IA, where they come from, and how pages link to IA. We find that users request English pages the most, followed by the European languages. Most human users come to Web archives because they do not find the requested pages on the live Web. About 65 % of the requested archived pages no longer exist on the live Web. We find that more than 82 % of human sessions connect to the Wayback Machine via referrals from other Web sites, while only 15 % of robots have referrers. Most of the links (86 %) from Websites are to individual archived pages at specific points in time, and of those 83 % no longer exist on the live Web. Finally, we find that users who come from search engines browse more pages than users who come from external Web sites.[PUBLICATION ABSTRACT]
DA  - 2014/08/23/
PY  - 2014
DO  - 10.1007/s00799-014-0111-5
VL  - 14
IS  - 3-4
SP  - 101
EP  - 115
LA  - English
SN  - 1432-5012
UR  - https://search.proquest.com/docview/1547814935?accountid=27464
L4  - https://arxiv.org/pdf/1309.4016
L4  - http://link.springer.com/article/10.1007/s00799-014-0111-5
L4  - http://link.springer.com/10.1007/s00799-014-0111-5
KW  - Information science
KW  - Digital libraries
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet
KW  - Data mining
ER  - 

TY  - JOUR
TI  - Web Archiving at the Library of Congress
AU  - Grotke, Abbie
T2  - Computers in Libraries
AB  - The selection of sites is not something that the LC automates; recommending officers (ROs) do this work.\n The goals of the consortium include collecting a rich body of internet content from around the world and fostering the development and use of common tools, techniques, and standards that enable the creation of international archives. IIPC members are currently engaged in a number of exciting projects: launching a worldwide education and training program that will feature technical and curatorial workshops and staff exchanges; planning an international collaborative collection project around the 2012 Summer Olympics; publishing information about the preservation of web archives in many institutional contexts; and establishing a technical program to fund exploratory projects and report about new techniques and tools to archives the fastchanging web.
DA  - 2011/12//
PY  - 2011
VL  - 31
IS  - 10
SP  - 15
EP  - 19
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/911079001?accountid=27464
L4  - http://eric.ed.gov/?id=EJ963355
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet resources
KW  - Web sites
KW  - Olympic games
ER  - 

TY  - JOUR
TI  - Web Archive Search as Research: Methodological and Theoretical Implications
AU  - Ben-David, Anat
AU  - Huurdeman, Hugo
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - The field of web archiving is at a turning point. In the early years of web archiving, the single URL has been the dominant unit for preservation and access. Access tools such as the Internet Archive's Wayback Machine reflect this notion as they allowed consultation, or browsing, of one URL at a time. In recent years, however, the single URL approach to accessing web archives is being gradually replaced by search interfaces. This paper addresses the theoretical and methodological implications of the transition to search on web archive research. It introduces 'search as research' methods, practices already applied in studies of the live web, which can be repurposed and implemented for critically studying archived web data. Such methods open up a variety of analytical practices that were so far precluded by the single URL entry point to the web archive, such as the re-assemblage of existing collections around a theme or an event, the study of archival artefacts and scaling the unit of analysis from the single URL to the full archive, by generating aggregate views and summaries. The paper introduces examples to 'search as research' scenarios, which have been developed by the WebART project at the University of Amsterdam and the Centrum Wiskunde & Informática, in collaboration with the National Library of the Netherlands. The paper concludes with a discussion of current and potential limitations of 'search as research' methods for studying web archives, and the ways with which they can be overcome in the near future.
DA  - 2014/08//
PY  - 2014
DO  - 10.7227/ALX.0022
VL  - 25
IS  - 1-2
SP  - 93
EP  - 111
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1623368254?accountid=27464
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0022
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0022
KW  - Internet Archive
KW  - web archives
KW  - Wayback Machine
KW  - Library And Information Sciences
KW  - national libraries
KW  - search
ER  - 

TY  - JOUR
TI  - Hard Content, Fab Front-End: Archiving Websites of Dutch Public Broadcasters
AU  - Baltussen, Lotte Belice
AU  - Blom, Jaap
AU  - Medjkoune, Leïla
AU  - Pop, Radu
AU  - Van Gorp, Jasmijn
AU  - Huurdeman, Hugo
AU  - Haaijer, Leidi
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - Although there are a great variety of web archiving projects around the world, there are not many that focus explicitly on websites of broadcasters. The reason is that funds are often lacking to do this, and that broadcaster websites are difficult to archive, due to their dynamic and audiovisual content. The Netherlands Institute for Sound and Vision, with its collection of over 800,000 hours of audiovisual content has been involved in a small-scale research project related to web archiving since 2008. When Sound and Vision was approached by Dutch public broadcaster NTR to archive four of its websites, it was decided to start a collaborative pilot project that focused both on learning more about archiving broadcaster websites and developing a clean and modern public access interface. The main lesson learned from this pilot is that to archive highly dynamic and AV-heavy broadcaster websites it is vital to use supplementary capture tools and manual archiving of this ‘difficult’ content. Furthermore, since the focus of web archiving projects is usually not on a good-looking front-end, the wheel had to be partly re-invented by involving various stakeholders and determining the most important requirements. The first version of the web archive was evaluated by various prospective target users. This evaluation revealed that the participants indeed appreciated the look and speed of the web archive, and that users needed to be made more aware of the web archive's purpose and limitations. The work will be continued and scaled up, by archiving more broadcaster websites, continuing the research on how best to capture and make accessible dynamic and AV content, and by creating standard practices for making the web archive publicly available.
DA  - 2014/08//
PY  - 2014
DO  - 10.7227/ALX.0021
VL  - 25
IS  - 1-2
SP  - 69
EP  - 91
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1623365171?accountid=27464
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0021
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0021
KW  - web archives
KW  - Library And Information Sciences
KW  - audiovisual material
KW  - broadcasters' websites
KW  - user studies
ER  - 

TY  - JOUR
TI  - Long-term preservation at the National Library of France (BnF): Scalable Preservation and Archiving Repository (SPAR)
AU  - Ledoux, Thomas
T2  - International Preservation News
AB  - The National Library of France (BnF) has the mission to collect, preserve and give access to all the published material in France. To this aim, the legal deposit has been extended to the different forms of publishing from the printed material in 1537, to electronic documents in 1992, as well as the Internet in 2006. To preserve all this digital cultural heritage, the BnF has designed a Scalable Preservation and Archiving Repository (SPAR). This central repository has to handle the diversity (media, formats, departments) by taking inspiration from good practices and standards. The key requirements of the system where: 1. OAIS compliance, 2. modularity and scalability, 3. abstraction, 4. use of well known formats and standards, 5. use of open-source technical building blocks.
DA  - 2012/08//
PY  - 2012
IS  - 57
SP  - 18
EP  - 20
LA  - English
UR  - https://search.proquest.com/docview/1124539611?accountid=27464
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Migration
KW  - Metadata
KW  - Information storage
KW  - Infrastructure
KW  - Product introduction
ER  - 

TY  - JOUR
TI  - Counting the uncountable: statistics for web archives
AU  - Oury, Clement
AU  - Poll, Roswitha
T2  - Performance Measurement and Metrics
AB  - Purpose - The purpose of this paper is to describe the aims and contents of the ISO Report ISO/TR 14873. Design/methodology/approach - For more than a decade, libraries have started to "collect the web". National libraries in particular select, collect and store publications and websites from their national domain, seeing this as a task similar to traditional legal deposit. The collection policies and collecting methods vary, so that it is difficult to compare the quantity and quality of the respective web archives. Findings - In order to harmonize the evaluation of web archives, ISO TC 46 SC 8 has produced a Technical Report that standardizes the terminology and statistics and offers tested indicators for assessing the quality of web archiving. Originality/value - This paper describes the shortly to be published ISO/TR 14873, a potentially vital guide to harmonize web archive collection internationally.
DA  - 2013///
PY  - 2013
DO  - http://dx.doi.org/10.1108/PMM-05-2013-0014
VL  - 14
IS  - 2
SP  - 132
EP  - 141
LA  - English
SN  - 14678047
UR  - https://search.proquest.com/docview/1399615625?accountid=27464
L4  - https://hal-bnf.archives-ouvertes.fr/hal-01098522/file/OuryPoll-Performance-2013-en.pdf
L4  - http://www.emeraldinsight.com/doi/abs/10.1108/PMM-05-2013-0014
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Library collections
KW  - Internet resources
KW  - Web sites
KW  - Software
KW  - Quality standards
KW  - Statistics
ER  - 

TY  - JOUR
TI  - National Libraries' Traditional Collection Policy Facing Web Archiving
AU  - Shveiky, Rivka
AU  - Bar-Ilan, Judit
T2  - Alexandria
AB  - One of the main missions of a national library is to preserve the national creative works in printed and non-printed formats. In the 1990s, national libraries began to harvest and archive the national body of creative work that was published on the internet. The aim of the study was to examine to what extent national libraries implement their general collection policy when they establish a national web archive. The study, which was based on a qualitative approach, had three phases: examining the characteristics of a traditional collection policy of a national library; identifying the characteristics of a collection policy of a national library’s web archive; and comparing the traditional collection characteristics with the national library’s web archive characteristics. The results showed that although the libraries that were studied were from different regions of the world and various cultures, the characteristics of their traditional collections are similar. In contrast, the difference between their web archives is more significant. National libraries do not apply the traditional policy to the internet, and struggle to shape new rules for coping with web contents.
DA  - 2013///
PY  - 2013
VL  - 24
IS  - 3
SP  - 37
EP  - 72
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1548796786?accountid=27464
L4  - http://journals.sagepub.com/doi/abs/10.7227/alx.0001
L4  - http://www.ingentaconnect.com/contentone/manup/alex/2013/00000024/00000003/art00004
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Lost but not forgotten: finding pages on the unarchived web
AU  - Huurdeman, Hugo C
AU  - Kamps, Jaap
AU  - Samar, Thaer
AU  - de Vries, Arjen P
AU  - Ben-David, Anat
AU  - Rogers, Richard A
T2  - International Journal on Digital Libraries
AB  - Issue Title: Focused Issue on Digital Libraries 2014 Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
DA  - 2015/09/03/
PY  - 2015
DO  - 10.1007/s00799-015-0153-3
VL  - 16
IS  - 3-4
SP  - 247
EP  - 265
LA  - English
SN  - 1432-5012
UR  - https://search.proquest.com/docview/1703890962?accountid=27464
L4  - http://link.springer.com/article/10.1007/s00799-015-0153-3
L4  - http://link.springer.com/10.1007/s00799-015-0153-3
KW  - Web archiving
KW  - Web archives
KW  - Digital libraries
KW  - Library And Information Sciences--Computer Applica
KW  - World Wide Web
KW  - Digital archives
KW  - Information retrieval
KW  - Anchor text
KW  - Link evidence
KW  - Web crawlers
ER  - 

TY  - JOUR
TI  - Collect, Preserve, Access: Applying the Governing Principles of the National Archives UK Government Web Archive to Social Media Content
AU  - Espley, Suzy
AU  - Carpentier, Florent
AU  - Pop, Radu
AU  - Medjkoune, Leïla
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - It is The National Archives' responsibility to collect and secure the future of the public record in all its forms and to make it as accessible as possible. The UK Government Web Archive1 (UKGWA) effectively preserves the open digital record. This article will explore the challenges encountered, and the Application Programming Interface (API) based solutions developed, by The National Archives and the Internet Memory Foundation (IMF) in the completion of a pilot project to capture the record as it is published on the social media services Twitter and YouTube. An outline of the wider web archiving programme and its role within the management of the government web estate is provided. The legislative framework that guides web archiving at The National Archives is described as it has necessarily influenced the policy decisions that shaped the solutions developed. A brief overview of some comparative approaches taken by other organizations and commercial services to capturing Twitter content is also presented as context to the policy and technical solutions arrived at by the authors. The National Archives has sought to develop the building blocks of a collection whose growth can be sustained over time. The publication of this part of the archive will be followed by further evaluation and improvements to the initial approach taken.
DA  - 2014/08//
PY  - 2014
DO  - 10.7227/ALX.0019
VL  - 25
IS  - 1-2
SP  - 31
EP  - 50
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1623367977?accountid=27464
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0019
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0019
KW  - web archives
KW  - social media
KW  - technology
KW  - Library And Information Sciences
KW  - government
KW  - public records
ER  - 

TY  - JOUR
TI  - Quality Assurance Paradigms in Web Archiving Pre and Post Legal Deposit
AU  - Bingham, Nicola
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - This article discusses quality assurance paradigms in the pre and post legal deposit environments, exploring how workflows and processes have adapted from a small-scale, selective model to domain-scale harvesting activity. It draws comparisons between the two approaches and discusses the trade-offs necessitated by the change in scale of web harvesting activity. The requirements of the non-print legal deposit legislation of 2013 and the change in scale in web archiving operations have necessitated new quality metrics for the web archive collection. Whereas it was possible to manually review every instance of a harvested website, the new model requires that more automated methods are employed. The article looks at the tools employed in the selective web archiving model such as the Web Curator Tool and those designed for the legal deposit workflow such as the Annotation and Curation Tool. It examines the key technical issues in archiving websites and how content is prioritized for quality assurance. The article will be of interest to people employed in memory institutions including national libraries who are tasked with preserving online content as well as a wider general audience.
DA  - 2014/08//
PY  - 2014
DO  - 10.7227/ALX.0020
VL  - 25
IS  - 1-2
SP  - 51
EP  - 68
LA  - English
SN  - 0955-7490
UR  - https://search.proquest.com/docview/1623368516?accountid=27464
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0020
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0020
KW  - web archiving
KW  - Library And Information Sciences
KW  - websites
KW  - libraries
KW  - curation
KW  - quality assurance
ER  - 

TY  - JOUR
TI  - Archiving the web using page changes patterns: a case study
AU  - Saad, Myriam Ben
AU  - Gançarski, Stéphane
T2  - International Journal on Digital Libraries
AB  - Issue Title: Focused Issue on Joint Conference on Digital Libraries (JCDL) 2011 A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.[PUBLICATION ABSTRACT]
DA  - 2012/12/06/
PY  - 2012
DO  - 10.1007/s00799-012-0094-z
VL  - 13
IS  - 1
SP  - 33
EP  - 49
LA  - English
SN  - 1432-5012
UR  - https://search.proquest.com/docview/1197168439?accountid=27464
L4  - http://dl.acm.org/citation.cfm?id=1998098
L4  - http://www-poleia.lip6.fr/~gancarsk/papers/BG_JCDL2011.pdf
L4  - http://link.springer.com/10.1007/s00799-012-0094-z
KW  - Models
KW  - Library And Information Sciences--Computer Applica
KW  - World Wide Web
KW  - Archives & records
KW  - Data mining
KW  - Case studies
ER  - 

TY  - JOUR
TI  - Case Studies in Web Sustainability
AU  - Turner, Scott
T2  - Ariadne
AB  - At the moment organisations often make significant investments in producing Web-based material, often funded through public money, for example from JISC. We are seeing cuts in funding or changes in governmental policy, which is resulting in the closure of some of these organisations. What happens to those Web resources when the organisations are no longer in existence? Public money has often been used to develop these resources - from that perspective it would be a shame to lose them. Moreover, the resources might be needed or someone may actually want to take over the maintenance of the site at a later date. JISC previously funded three projects to look at this area through a programme called Sustaining at risk online resources [1]. One of these projects, which ran at The University of Northampton, looked into rescuing one of the recently closed East Midlands Universities Associations online resources. This resource, called East Midlands Knowledge Network (EMKN), lists many of the knowledge transfer activities of 10 of the East Midlands universities. The project looked at options on how to migrate the site to a free hosting option to make it make it more sustainable even when it is no longer available on the original host's servers. This article looks at this work as a case study on Web sustainability and also included a case study of another project where Web sustainability was central. Adapted from the source document.
DA  - 2012/11//
PY  - 2012
IS  - 70
LA  - English
SN  - 1361-3200, 1361-3200
UR  - https://search.proquest.com/docview/1680141236?accountid=27464
L4  - https://www.ariadne.ac.uk/issue70/turner
KW  - Web archiving
KW  - Preservation
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Universities
KW  - Projects
KW  - Web hosting
ER  - 

TY  - JOUR
TI  - Digital Humanities in the 21st Century:Digital Material as a Driving Force
AU  - Brügger, Niels
T2  - Digital Humanities Quarterly
AB  - In this article it is argued that one of the major transformative factors of the humanities at the beginning of the 21st century is the shift from analogue to digital source material, and that this shift will affect the humanities in a variety of ways. But various kinds of digital material are not digital in the same way, which a distinction between digitized, born-digital, and reborn-digital may help us acknowledge, thereby helping us to understand how each of these types of digital material affects different phases of scholarly work in its own way. This is illustrated by a detailed comparison of the nature of digitized collections and web archives. ; In this article it is argued that one of the major transformative factors of the humanities at the beginning of the 21st century is the shift from analogue to digital source material, and that this shift will affect the humanities in a variety of ways. But various kinds of digital material are not digital in the same way, which a distinction between digitized, born-digital, and reborn-digital may help us acknowledge, thereby helping us to understand how each of these types of digital material affects different phases of scholarly work in its own way. This is illustrated by a detailed comparison of the nature of digitized collections and web archives.
DA  - 2016///
PY  - 2016
VL  - 10
IS  - 3
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.digitalhumanities.org/dhq/vol/10/3/000256/000256.html
KW  - web archiving
KW  - web
KW  - web archive
KW  - born digital
KW  - digital humaniora
KW  - digital humanities
KW  - digital material
KW  - digitaliseret
KW  - digitalitet
KW  - digitality
KW  - digitalt materiale
KW  - digitised
KW  - født digitalt
KW  - genfødt digitalt
KW  - reborn digital
KW  - webarkiv
KW  - webarkivering
ER  - 

TY  - JOUR
TI  - Methods of Web Philology: Computer Metadata and Web Archiving in the Primary Source Documents of Contemporary Esotericism
AU  - Plaisance, Christopher
T2  - International Journal for the Study of New Religions
AB  - This article explores the issues surrounding the critical analysis of first generation electronic objects within the context of the study of contemporary esoteric discourse. This is achieved through a detailed case study of Benjamin Rowe's work, A Short Course in Scrying, which is solely exemplified by digital witnesses. This article demonstrates that the critical analysis of these witnesses is only possible by adapting the general methods of textual scholarship to the specific techniques of digital forensics-particularly the analysis of computer metadata and web archives. The resulting method, here termed web philology, is applicable to the critical analysis by the scholar of religion of any primary source documents originating on the web as electronic objects. [ABSTRACT FROM AUTHOR]
DA  - 2016/05/31/
PY  - 2016
DO  - 10.1558/ijsnr.v7i1.26074
VL  - 7
IS  - 1
SP  - 43
EP  - 68
SN  - 2041-9511
UR  - http://10.0.6.22/ijsnr.v7i1.26074
L4  - https://www.equinoxpub.com/journals/index.php/IJSNR/article/view/26074
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - methodology
KW  - contemporary esotericism
KW  - DIGITAL electronics
KW  - digital forensics
KW  - ESOTERICISM
KW  - METADATA
KW  - PHILOLOGY
KW  - textual scholarship
KW  - web philology
ER  - 

TY  - CONF
TI  - MemGator - A Portable Concurrent Memento Aggregator
AU  - Alam, Sawood
AU  - Nelson, Michael L.
AB  - The Memento protocol makes it easy to build a uniform lookup service to aggregate the holdings of web archives. However, there is a lack of tools to utilize this capability in archiving applications and research projects. We created MemGator, an open source, easy to use, portable, concurrent, cross-platform, and self-documented Memento aggregator CLI and server tool written in Go. MemGator implements all the basic features of a Memento aggregator (e.g., TimeMap and TimeGate) and gives the ability to customize various options including which archives are aggregated. It is being used heavily by tools and services such as Mink, WAIL, OldWeb. today, and archiving research projects and has proved to be reliable even in conditions of extreme load.
C1  - New York, New York, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries - JCDL '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2925452
SP  - 243
EP  - 244
PB  - ACM Press
SN  - 978-1-4503-4229-2
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://dl.acm.org/citation.cfm?doid=2910896.2925452
KW  - Memento
KW  - Web Archiving
KW  - Computer science
KW  - Computing and Processing
KW  - Aggregates
KW  - Aggregator
KW  - Concurrent computing
KW  - MemGator
KW  - Protocols
KW  - Reliability
KW  - Servers
KW  - Stress
ER  - 

TY  - JOUR
TI  - Preserving the internet
AU  - Shein, Esther
T2  - Communications of the ACM
AB  - The article looks at efforts to preserve the contents of the Internet for future generations. Particular focus is given to the Global Database of Events, Language, and Tone (GDELT) project, led by computer scientist Kaylev Leetaru, and to the not-for-profit digital library known as the Internet Archive. Topics include the alteration of online documents such as government press releases and the digitization of books and other museum and library collections.
DA  - 2015/12/21/
PY  - 2015
DO  - 10.1145/2843553
VL  - 59
IS  - 1
SP  - 26
EP  - 28
SN  - 00010782
UR  - http://10.0.4.121/2843553
L4  - http://dl.acm.org/citation.cfm?doid=2859829.2843553
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - WEB archives
KW  - INTERNET Archive (Firm)
KW  - DIGITIZATION of library materials
KW  - Kaylev
KW  - LEETARU
ER  - 

TY  - JOUR
TI  - Preserving Seeds of Knowledge: A Web Archiving Case Study.
AU  - Heil, Jeremy M
AU  - Jin, Shan
T2  - Information Management Journal
AB  - The article presents a case study on the initiative of Queen's University in Ontario, Canada on a project which seeks to preserve website contents. Topics include the procedures involved in archiving web contents, the software tools used by archivists including its subscription to WayBackMachine, and the officers that make up the project.
DA  - 2017/05//
PY  - 2017
VL  - 51
IS  - 3
SP  - 20
EP  - 22
SN  - 15352897
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://search.proquest.com/docview/1923668954/fulltextPDF/E6801C31C031493DPQ/1?accountid=27464
KW  - WEB archiving
KW  - WEB archives
KW  - INTERNET content
KW  - Ont.)
KW  - QUEEN'S University (Kingston
KW  - WEBSITE maintenance & repair
ER  - 

TY  - JOUR
TI  - Historians and Web Archives.
AU  - Belovari, Susanne
T2  - Archivaria
AB  - Since the 1990s, the Web has increasingly become the location where we carry out our activities and generate primary and secondary records. Increasingly, such records exist only on the Web, with no complementary or supplementary records available elsewhere. While web archives began to preserve this legacy in 1991, web history has not yet emerged as a fully developed field. One explanation may be historians' concerns that they will not be able to replicate their historical research process when using web archives, and may not find essential and authoritative records. The article's first section proposes a thought experiment in which a future historian in 2050 wants to research web history using web archives as they existed in 2015. She relies on the customary historical research process through which historians choose topics and search, browse, and contextualize sources in depth and iteratively. The experiment fails when our historian is unable to locate appropriate repositories and authoritative records without resorting to the live Web of 2015. The second section then analyzes 21 eminent web archives in 2015 and issues that may have an impact on historical research. Most web archives are apparently akin to libraries of information resources. Archivists and historians, however, need web repositories to contain and make accessible essential web records of enduring cultural, historical, and evidentiary value. The article suggests that historians may once again prove invaluable in figuring out basic archival issues related to web records and archives, just as they helped shape archival policies a couple of centuries ago. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017///
PY  - 2017
IS  - 83
SP  - 59
EP  - 79
SN  - 03186954
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://archivaria.ca/index.php/archivaria/article/view/13600/14985
KW  - Web archiving
KW  - Web archives
KW  - Archives
KW  - Digital libraries
KW  - Historians
ER  - 

TY  - THES
TI  - Performance Measurement and Analysis of Transactional Web Archiving
AU  - Maharshi, Shivam
AB  - Web archiving is necessary to retain the history of the World Wide Web and to study its evolution. It is important for the cultural heritage community. Some organizations are legally obligated to capture and archive Web content. The advent of transactional Web archiving makes the archiving process more efficient, thereby aiding organizations to archive their Web content. This study measures and analyzes the performance of transactional Web archiving systems. To conduct a detailed analysis, we construct a meaningful design space defined by the system specifications that determine the performance of these systems. SiteStory, a state-of-the-art transactional Web archiving system, and local archiving, an alternative archiving technique, are used in this research. We experimentally evaluate the performance of these systems using the Greek version of Wikipedia deployed on dedicated hardware on a private network. Our benchmarking results show that the local archiving technique uses a Web server’s resources more efficiently than SiteStory for one data point in our design space. Better performance than SiteStory in such scenarios makes our archiving solution favorable to use for transactional archiving. We also show that SiteStory does not impose any significant performance overhead on the Web server for the rest of the data points in our design space.
DA  - 2017///
PY  - 2017
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://hdl.handle.net/10919/78371
L4  - https://vtechworks.lib.vt.edu/bitstream/handle/10919/78371/Maharshi_S_T_2017.pdf?sequence=1
KW  - Digital Preservation
KW  - Web Archiving
KW  - Performance Benchmark
ER  - 

TY  - CONF
TI  - Hungarian web archiving pilot project in the National Széchényi Library
AU  - Nemeth, Marton
AU  - Drotos, Laszlo
AB  - This demo paper introduces the web archiving pilot project in the Hungarian National Széchényi Library. Basic conception and goals are being described.
C3  - 2017 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)
DA  - 2017/09//
PY  - 2017
DO  - 10.1109/CogInfoCom.2017.8268244
SP  - 000209
EP  - 000212
PB  - IEEE
SN  - 978-1-5386-1264-4
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://ieeexplore.ieee.org/document/8268244/
KW  - Collaboration
KW  - web archiving
KW  - Communication
KW  - Conferences
KW  - General Topics for Engineers
KW  - Networking and Broadcast Technologies
KW  - Robotics and Control Systems
KW  - Internet
KW  - Web sites
KW  - Libraries
KW  - National Széchényi Library
KW  - pilot project
KW  - Software
KW  - Terrorism
ER  - 

TY  - JOUR
TI  - Mining the information architecture of the WWW using automated website boundary detection.
AU  - Alshukri, Ayesh
AU  - Coenen, Frans
T2  - Web Intelligence (2405-6456)
AB  - The world wide web has two main forms of architecture, the first is that which is explicitly encoded into web pages, and the second is that which is implied by the web content, particularly pertaining to look and feel. The latter is exemplified by the concept of a website, a concept that is only loosely defined, although users intuitively understand it. The Website Boundary Detection (WBD) problem is concerned with the task of identifying the complete collection of web pages/resources that are contained within a single website. Whatever the case, the concept of a website is used with respect to a number of application domains including; website archiving, spam detection, and www analysis. In the context of such applications it is beneficial if a website can be automatically identified. This is usually done by identifying a website of interest in terms of its boundary, the so called WBD problem. In this paper seven WBD techniques are proposed and compared, four statistical techniques where the web data to be used is obtained apriori, and three dynamic techniques where the data to be used is obtained as the process progresses. All seven techniques are presented in detail and evaluated. [ABSTRACT FROM AUTHOR]
DA  - 2017/10//
PY  - 2017
VL  - 15
IS  - 4
SP  - 269
EP  - 290
SN  - 24056456
UR  - http://10.0.12.161/WEB-170365
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - INTERNET
KW  - digital preservation
KW  - WEBSITES
KW  - SPAM (Email)
KW  - COMPUTER network resources
KW  - INTERNET content
KW  - random walk techniques
KW  - web graphs
KW  - web page clustering
KW  - Web structure mining
KW  - website boundary detection
ER  - 

TY  - JOUR
TI  - Web Arkiver ; Web archives
AU  - Finnemann, Niels Ole
AB  - This article deals with general web archives and the principles for selection of materials to be preserved. It opens with a brief overview of reasons why general web archives are needed. Section two and three present major, long termed web archive initiatives and discuss the purposes and possible values of web archives and asks how to meet unknown future needs, demands and concerns. Section four analyses three main principles in contemporary web archiving strategies, topic centric, domain centric and time-centric archiving strategies and section five discuss how to combine these to provide a broad and rich archive. Section six is concerned with inherent limitations and why web archives are always flawed. The last sections deal with the question how web archives may fit into the rapidly expanding, but fragmented landscape of digital repositories taking care of various parts of the exponentially growing amounts of still more heterogeneous data materials. ; This article deals with general web archives and the principles for selection of materials to be preserved. It opens with a brief overview of reasons why general web archives are needed. Section two and three present major, long termed web archive initiatives and discuss the purposes and possible values of web archives and asks how to meet unknown future needs, demands and concerns. Section four analyses three main principles in contemporary web archiving strategies, topic centric, domain centric and time-centric archiving strategies and section five discuss how to combine these to provide a broad and rich archive. Section six is concerned with inherent limitations and why web archives are always flawed. The last sections deal with the question how web archives may fit into the rapidly expanding, but fragmented landscape of digital repositories taking care of various parts of the exponentially growing amounts of still more heterogeneous data materials.
DA  - 2018///
PY  - 2018
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://curis.ku.dk/portal/files/201560994/Web_archive_ISKO_ENCYC2018.pdf
KW  - Web archiving
KW  - /dk/atira/pure/core/keywords/FacultyOfHumanities
KW  - domain centric archiving
KW  - Faculty of Humanities
KW  - Selection criteria
KW  - time centric archiving
KW  - topic centric archiving
KW  - web materials
ER  - 

TY  - JOUR
TI  - Web-Archiving Chinese Social Media: Final Project Report August 2017.
AU  - Ye, Yunshan
AU  - Ye, Ding
AU  - Zeljak, Cathy
AU  - Kerchner, Daniel
AU  - He, Yan
AU  - Littman, Justin
T2  - Journal of East Asian Libraries
DA  - 2017/10//
PY  - 2017
IS  - 165
SP  - 93
EP  - 112
SN  - 10875093
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2701&context=jeal
KW  - Web archiving
KW  - 1953-
KW  - Academic librarians -- China
KW  - Activism
KW  - Jinping
KW  - Online social networks
KW  - Xi
ER  - 

TY  - THES
TI  - Intelligent Event Focused Crawling
AU  - Farag, Mohamed Magdy Gharib
AB  - There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effective access. We extend the traditional focused (topical) crawling techniques in two directions, modeling and representing: events and webpage source importance.We developed an event model that can capture key event information (topical, spatial, and temporal). We incorporated that model into the focused crawler algorithm. For the focused crawler to leverage the event model in predicting a webpage's relevance, we developed a function that measures the similarity between two event representations, based on textual content.Although the textual content provides a rich set of features, we proposed an additional source of evidence that allows the focused crawler to better estimate the importance of a webpage by considering its website. We estimated webpage source importance by the ratio of number of relevant webpages to non-relevant webpages found during crawling a website. We combined the textual content information and source importance into a single relevance score.For the focused crawler to work well, it needs a diverse set of high quality seed URLs (URLs of relevant webpages that link to other relevant webpages). Although manual curation of seed URLs guarantees quality, it requires exhaustive manual labor. We proposed an automated approach for curating seed URLs using social media content. We leveraged the richness of social media content about events to extract URLs that can be used as seed URLs for further focused crawling.We evaluated our system through four series of experiments, using recent events: Orlando shooting, Ecuador earthquake, Panama papers, California shooting, Brussels attack, Paris attack, and Oregon shooting. In the first experiment series our proposed event model representation, used to predict webpage relevance, outperformed the topic-only approach, showing better results in precision, recall, and F1-score. In the second series, using harvest ratio to measure ability to collect relevant webpages, our event model-based focused crawler outperformed the state-of-the-art focused crawler (best-first search). The third series evaluated the effectiveness of our proposed webpage source importance for collecting more relevant webpages. The focused crawler with webpage source importance managed to collect roughly the same number of relevant webpages as the focused crawler without webpage source importance, but from a smaller set of sources. The fourth series provides guidance to archivists regarding the effectiveness of curating seed URLs from social media content (tweets) using different methods of selection.
DA  - 2016///
PY  - 2016
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://hdl.handle.net/10919/73035
KW  - Web Archiving
KW  - Digital Libraries
KW  - Event Modeling
KW  - Focused Crawling
KW  - Seed URLs Selection
KW  - Social Media Mining
KW  - Web Mining
ER  - 

TY  - JOUR
TI  - Digitálne pramene - národný projekt zberu a archivácie v roku 1.
AU  - Androvič, Ing. Alojz
AU  - Bizík, Bc. Andrej
AU  - Hausleitner, Ing. Peter
AU  - Katrincová, PhDr. Beáta
AU  - Lacková, Mgr. Iveta
AU  - Matúšková, PhDr. Jana
T2  - Knihovna PLUS
AB  - In 2015 the University Library in Bratislava put in the practice the national project Digital Resources -- Webharvesting and E-Born Content Archiving. The project was running in the framework of the Operational Program Informatisation of Society. Its ambition was to establish a technical, application and management infrastructure for systematical harvesting and long term preservation of web pages and e-Born resources. The implementation is based on open source software modules (Heritrix, OpenWayback, Invenio). The systems management is optimized for parallel webharvesting. This article presents the experiences and results of the operation of IS Digital Resources in 2016. It describes the workflow of webharvesting and acquisition of e-Born resources and discusses some methodological and practical problems in dealing with e-Born serials. The article brings the analytical and statistical overview of harvests realised in 2016 with a special highlight on the complex harvest of the national .sk domain. (English) [ABSTRACT FROM AUTHOR]
DA  - 2017/01//
PY  - 2017
IS  - 1
SP  - 1
EP  - 14
SN  - 18015948
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - WARC
KW  - archivácia webu
KW  - digital curation
KW  - digitálne kurátorstvo
KW  - e-Born pramene
KW  - e-Born resources
KW  - ISSN
KW  - web analytics
KW  - webharvesting
KW  - webová analytika
KW  - zber webu
ER  - 

TY  - RPRT
TI  - Using a Web-Archiving Service - How to ensure your cited web-references remain available and valid
AU  - Levy, David Claude
AB  - In today’s electronic information age, academic authors increasingly cite online resources such as blog posts, news articles, online policies and reports in their scholarly publications. Citing such webpages, or their URLs, poses long-term accessibility concern due to the ephemeral nature of the Internet: webpages can (and do!) change or disappear1 over time. When looking up cited web references, readers of scholarly publications might thus find content that is different from what author/s originally referenced; this is referred to as ‘content drift’. Other times, readers are faced with a ‘404 Page Not Found’ message, a phenomenon known as ‘link rot’2. A recent Canadian study3 for example found a 23% link rot when examining 11,437 links in 664 doctoral dissertations from 2011-2015. Older publications are likely to face even higher rates of invalid links. Luckily, there are a few things you can do to make your cited web references more stable. The most common method is to use a web archiving service. Using a web archiving service means your web references and links are more likely to connect the reader to the content accessed at the time of writing/citing. In other words, references are less likely to “rot” or “drift” over time. As citing authors, we have limited influence on preserving web content that we don’t own. We are generally at the mercy of the information custodians who tend to adjust, move or delete their web content to keep their site(s) current and interesting. All we can do to keep web content that we don’t own but want to cite intact so that our readers can still access it in years to come is to create a “representative memento" of the online material as it was at the time of citing. This can be achieved by submitting the URL of the webpage we want to cite to a web archiving service which will generate a static (‘cached’) copy of it and allocate it a new, unique and permanent link, also called ‘persistent identifier’. We can then use this new link to the archived webpage rather than the ephemeral link to the original webpage for our citation purposes. There are a range of web archives available. This guide contains a list of trusted web archiving services.
CY  - Australia, Australia/Oceania
DA  - 2017///
PY  - 2017
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://hdl.handle.net/2123/17210
KW  - web archiving
KW  - content drift
KW  - link rot
KW  - referencing online links
KW  - web referencing
ER  - 

TY  - CONF
TI  - Impact of URI Canonicalization on Memento Count
AU  - Kelly, Matt
AU  - Alkwai, Lulwah M.
AU  - Alam, Sawood
AU  - Van de Sompel, Herbert
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
AB  - Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representation at the datetime), often also present in the TimeMap. This infers that confidently obtaining an accurate count quantifying the number of non-forwarding captures for a URI-R is not possible using a TimeMap alone and that the magnitude of a TimeMap is not equivalent to the number of representations it identifies. In this work we discuss this particular phenomena in depth. We also perform a breakdown of the dynamics of counting mementos for a particular URI-R (google.com) and quantify the prevalence of the various canonicalization patterns that exacerbate attempts at counting using only a TimeMap. For google.com we found that 84.9% of the URI-Ms result in an HTTP redirect when dereferenced. We expand on and apply this metric to TimeMaps for seven other URI-Rs of large Web sites and thirteen academic institutions. Using a ratio metric DI for the number of URI-Ms without redirects to those requiring a redirect when dereferenced, five of the eight large web sites' and two of the thirteen academic institutions' TimeMaps had a ratio of ratio less than one, indicating that more than half of the URI-Ms in these TimeMaps result in redirects when dereferenced.
C1  - United States, North America
C3  - 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)
DA  - 2017/06//
PY  - 2017
DO  - 10.1109/JCDL.2017.7991601
SP  - 1
EP  - 2
PB  - IEEE
SN  - 978-1-5386-3861-3
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://ieeexplore.ieee.org/document/7991601/
KW  - Web archiving
KW  - Memento
KW  - Computer Sciences
KW  - HTTP
KW  - Canonicalization patterns
KW  - Data patterns
KW  - URI
KW  - URI-M
KW  - Web Archive
KW  - Redirection
ER  - 

TY  - JOUR
TI  - Assembling the Living Archive: A Media-Archaeological Excavation of Occupy Wall Street
AU  - Buel, Jason W
T2  - Public Culture
AB  - The article discusses the issues behind the social protest called Occupy Wall Street (OWS) that was staged in Zuccotti Park, Manhattan, New York in September 2011. Also cited are the efforts to archive the movement to preserve its history in a decentralized online archive, as well as the efforts by the OWS Archives Working Group in the archival process.
DA  - 2018/05/01/
PY  - 2018
DO  - 10.1215/08992363-4310930
VL  - 30
IS  - 2
SP  - 283
EP  - 303
SN  - 0899-2363
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://read.dukeupress.edu/public-culture/article/30/2/283/133936/Assembling-the-Living-Archive-A
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - OCCUPY protest movement
KW  - OCCUPY Wall Street protest movement
KW  - SOCIAL movements
ER  - 

TY  - JOUR
TI  - Developing Web Archiving Metadata Best Practices to Meet User Needs
AU  - Dooley, Jackey
T2  - Journal of Western Achives
AB  - The OCLC Research Library Partnership Web Archiving Metadata Working Group was established to meet a widely recognized need for best practices for descriptive metadata for archived websites. The Working Group recognizes that development of successful best practices intended to ensure discoverability requires an understanding of user needs and behavior. We have therefore conducted an extensive literature review to build our knowledge and will issue a white paper summarizing what we have learned. We are also studying existing and emerging approaches to descriptive metadata in this realm and will publish a second report recommending best practices. We will seek broad community input prior to publication.
DA  - 2017///
PY  - 2017
VL  - 8
IS  - 2
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1079&context=westernarchives
KW  - web archiving
KW  - Metadata
KW  - best practices
KW  - Cataloging of archival materials
KW  - descriptive metadata
ER  - 

TY  - JOUR
TI  - Archive-It.
AU  - Leach-Murray, Susan
T2  - Technical Services Quarterly
AB  - The article reviews the website "Archive-It" located at https://archive-it.org, which is a subscription web archiving service that collects and assesses cultural heritage on the Internet.
DA  - 2018/04//
PY  - 2018
VL  - 35
IS  - 2
SP  - 214
SN  - 07317131
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - CULTURAL property -- Computer network resources
KW  - WEBSITE reviews
ER  - 

TY  - JOUR
TI  - How can we improve our web collection? An evaluation of webarchiving at the KB National Library of the Netherlands (2007-2017)
AU  - Sierman, Barbara
AU  - Teszelszky, Kees
T2  - Alexandria
AB  - The Koninklijke Bibliotheek, the Dutch National Library (KB-NL), started in 2007 the project “web archiving” based on a selection of Dutch websites. The initial selection of 1,000 websites has currently grown into over 12,000 selected web sites, crawled on different intervals. Although due to legal restrictions the current use is limited to the KB-NL reading room, it is important that the KB-NL includes the requirements of the (future) users in her approach of creating a web collection. With respect to the long term preservation of the collection, we also need to incorporate the requirements for long term archiving in our approach, as described in the Open Archival Information Model (OAIS)1. This article describes the results of a research project on web archiving and the web collection of archived sites in the KB-NL, investigating the following questions. What is web archiving in the Netherlands? What are the selection criteria of KB-NL and how are these related to what can be found on the Dutch web by the contemporary user? What is the influence of the choice of tools we use to harvest on the final archived website? Do we know enough of the value of the web collection and the potential usage of it by researchers and how can we improve this value? This article will describe the outcomes of the research, the conclusions and advice that can be drawn from it and will hopefully inspire broader discussions about the essence of creating web collections for long term preservation as part of cultural heritage.
DA  - 2017///
PY  - 2017
DO  - 10.1177/0955749017725930
VL  - 27
IS  - 2
SP  - 94
EP  - 106
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://journals.sagepub.com/doi/abs/10.1177/0955749017725930?journalCode=alaa
KW  - web archiving
KW  - digital preservation
KW  - KB National Library of the Netherlands
KW  - OAIS
ER  - 

TY  - BOOK
TI  - How Perceptions of Web Resource Boundaries Differ for Institutional and Personal Archives
AU  - Poursardar, Faryaneh
AB  - What is and is not part of a web resource does not have a simple answer. Exploration of web resource boundaries have shown that people's assessments of resource bounds rely on understanding relationships between content fragments on the same web page and between content fragments on different web pages. This study explores whether such perceptions change based on whether the archive is for personal use or is institutional in nature. This survey explores user expectations when accessing archived web resources. Participants in the study were asked to assume they are making use of an archive provided by an institution tasked with preserving online resources, such as a digital archive that is part of the Library of Congress. Groups of pair web pages presented to the participants. Each group has a primary web page that is the resource being saved by the institutional archive. Each group has several subsequent parts or pages, which we will ask about. Consistent with our previous study on personal archiving, the primary-page content in the study comes from multi-page stories, multi-image collections, product pages with reviews and ratings on separate pages, and short single page writings. Participants were asked to assume the institutional archive wants to preserve the primary page and then answer what else they would expect to be saved along with the primary page. The results show that there are similar expectations for preserving continuations of the main content in personal and institutional archiving scenarios, institutional archives are more likely to be expected to preserve the context of the main content, such as additional linked content, advertisements, and author information.
DA  - 2018///
PY  - 2018
SP  - 126
PB  - IEEE
SN  - 978-1-5386-2659-7
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - digital preservation
KW  - Communication
KW  - Computer science
KW  - Computing and Processing
KW  - Conferences
KW  - Data science
KW  - General Topics for Engineers
KW  - Image sequences
KW  - Institutional archiving
KW  - Networking and Broadcast Technologies
KW  - personal archiving
KW  - Robotics and Control Systems
KW  - Signal Processing and Analysis
KW  - Task analysis
KW  - Uniform resource locators
KW  - user study
KW  - Web pages
ER  - 

TY  - JOUR
TI  - Usos do Arquivamento da Web na Comunicação Científica.
AU  - Ferreira, Lisiane Braga
AU  - Martins, Marina Rodrigues
AU  - Rockembach, Moisés
T2  - Uses of Web Archiving in Scientific Communication.
AB  - This research analyzes the web environment and the information produced in this medium, aiming to configure web archiving as an object of study, as a source of research data, along with scientific communication, as a practice of disseminating knowledge produced in universities. The methodology was delimited as exploratory research, based on an international bibliographic review on the subject, and analysis of the Initiatives of the International Consortium for the Preservation of the Internet (IIPC) related with Universities. It uses qualitative analysis of the objectives and projects developed by these initiatives. It concludes that the Web archiving is a field still not explored enough, namely inside the universities, and it observes the lack of research in Latin America context, especially in Brazil. (English) [ABSTRACT FROM AUTHOR]
DA  - 2018/01//
PY  - 2018
IS  - 36
SP  - 78
EP  - 98
SN  - 16463153
UR  - http://10.0.84.243/16463153/36a5
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Arquivamento da web
KW  - Ciência da Informação
KW  - Comunicação Científica
KW  - Information Science
KW  - Scientific Communication
ER  - 

TY  - JOUR
TI  - Archive-It 2: Internet Archive Strives to Ensure Preservation and Accessibility
AU  - Mcclure, Marji
T2  - EContent
AB  - Preserving seemingly ephemeral Web content is a daunting task. The problem is even more difficult because the content of Web pages changes and the pages themselves come and go with great frequency, which means simply collecting URLs is not enough to keep tabs on valuable content. To help make digital content preservation possible, Internet Archive, a San Francisco-based nonprofit has led a charge to effectively capture and store Web content. The project recently released Archive-It 2 in its continued effort to archive the Web. Version 2 of Archive-It offers several new features not available in Version 1. Subscribers can now conduct test crawls, which enable them to see the type of Web material that would populate a specific collection before it is archived permanently. There is also a metadata search capability, which allows metadata to be included in the text searches of materials in a collection.
DA  - 2006/10//
PY  - 2006
VL  - 29
IS  - 8
SP  - 14
EP  - 15
LA  - English
SN  - 15252531
UR  - https://search.proquest.com/docview/213815870?accountid=27464
KW  - Digital libraries
KW  - Archives & records
KW  - United States--US
KW  - 7500:Product planning & development
KW  - 8331:Internet services industry
KW  - 9190:United States
KW  - Business And Economics--Management
KW  - Service introduction
KW  - Web content delivery
ER  - 

TY  - JOUR
TI  - Políticas e tecnologias de preservação digital no arquivamento da web ; Policies and technologies to digital preservation in web archiving ; Política y tecnologías de preservación digital en el archivo de la web
AU  - Rockembach, Moisés
AB  - O objetivo do artigo foi analisar a preservação digital a partir da abordagem de arquivamento da web, desde as tecnologias envolvidas no processo de arquivamento, bem como políticas de seleção, preservação e disponibilização destes conteúdos, além do estudo de instituições internacionais que atuam na preservação da web. A metodologia utiliza pesquisa bibliográfica e documental sobre iniciativas internacionais de arquivamento da web e objetiva fomentar a discussão no Brasil, assim como servir de subsídio para estudos aplicados. Analisa as publicações científicas na base de periódicos Scopus dos últimos cinco anos (2012-2016) que versam sobre o arquivamento da web, políticas de seleção dos conteúdos web e tecnologias aplicadas à coleta, armazenamento e acesso aos websites arquivados. Traz também um panorama das tecnologias utilizadas pela comunidade de iniciativas de arquivamento da web, a partir da identificação dos dados disponibilizados no site do Consórcio Internacional de Preservação da Internet. Conclui que países que ainda não possuem iniciativas próprias, como o Brasil, com o estabelecimento de políticas de seleção com enfoques específicos (institucionais, temáticas, por domínio, etc.), assim como uma gestão do ciclo de vida do arquivamento da web e a adoção de tecnologias no formato código aberto (open source) podem não só preservar sua memória digital, mas também contribuir com a comunidade internacional de arquivamento da web. ; The objective of this paper was to analyze digital preservation from the web archiving approach, addressing the technologies involved in the archiving process, as well as policies for the selection, preservation and availability of these contents, as well as the study of international institutions that work on preservation of the web. The methodology uses bibliographic and documentary research on international archival web initiatives and aims to foment the discussion in Brazil, as well as to serve as a subsidy for applied studies. It analyzes the scientific publications based on Scopus journals of the last five years (2012-2016) that deal with web archiving, web content selection policies and technologies applied to the harvest, storage and access to archived websites. It also provides an overview of the technologies used by the community of web archiving initiatives, based on the identification of the data available on the website of the International Internet Preservation Consortium. It concludes that countries that do not yet have their own initiatives, such as Brazil, with the establishment of selection policies with specific approaches (institutional, thematic, domain, etc.), as well as web archive adoption of open source technologies can not only preserve your digital memory but also contribute to the international web archiving communityt. ; El objetivo del artículo fue analizar la preservación digital a partir del abordaje del archivamiento de la web, desde las tecnologías involucradas en el proceso de archivo, así como políticas de selección, preservación y puesta a disposición de estos contenidos, además del estudio de instituciones internacionales que actúan en la preservación de la información de la web. La metodología utiliza fue la investigación bibliográfica y documental sobre iniciativas internacionales de archivado de la web y objetiva fomentar la discusión en Brasil, así como servir de subsidio para estudios aplicados. Analiza las publicaciones científicas de la base de datos Scopus en los últimos cinco años (2012-2016) que versan sobre el archivamiento de la web, políticas de selección de los contenidos de la web y tecnologías aplicadas a la recolección, almacenamiento y acceso a los sitios web archivados. También trae un panorama de las tecnologías utilizadas por la comunidad que participa de las iniciativas de archivamiento de la web, a partir de la identificación de los datos disponibles en el sitio del Consorcio Internacional de Preservación de Internet. Concluye que países que aún no tienen iniciativas propias, como Brasil, con el establecimiento de políticas de selección con enfoques específicos (institucionales, temáticos, por dominio, etc.), así como una gestión del ciclo de vida del archivo de la web y la adopción de tecnologías en el formato de código abierto (open source) pueden no sólo preservar su memoria digital, sino también contribuir con la comunidad internacional de archivamiento de la web.
DA  - 2018///
PY  - 2018
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Digital preservation
KW  - Archivamiento de la web
KW  - Arquivamento da Web
KW  - Política de preservação
KW  - Políticas de preservación
KW  - Preservação digital
KW  - Preservación digital
KW  - Preservation policy
ER  - 

TY  - GEN
TI  - To Relive the Web: A Framework for the Transformation and Archival Replay of Web Pages
AU  - Berlin, Andrew, John
AB  - When replaying an archived web page (known as a memento), the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives to modify the page and its embedded resources, so that they no longer reference (link to) the original server(s) they were archived from but back to the archive. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. Unfortunately, because the replay of mementos and the modifications made to them by web archives in order to facilitate replay varies between archives, the terminology for describing replay and the modification made to mementos for facilitating replay does not exist. In this thesis, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos in order to facilitate replay. This thesis also, in the process of defining terminology for the modifications made by client-side rewriting libraries to the JavaScript execution environment of the browser during replay, proposes a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. Also by using the generated client-side rewriter, we were able to replay mementos that were previously not replayable from the Internet Archive.
DA  - 2018///
PY  - 2018
PB  - ODU Digital Commons
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Memento
KW  - Client-side rewriting
KW  - Computer Sciences
KW  - Digital Communications and Networking
KW  - High-fidelity replay
KW  - JavaScript
KW  - Web archive replay
ER  - 

TY  - JOUR
TI  - Archiving Web Content: An Online Searcher Roundtable
AU  - Careless, James
T2  - Online Searcher
AB  - In a roundtable discussion, several executives shared their views about archiving web content. Library of Congress' Office of Strategic Initiatives leader Abbie Grotke said the Library's web archiving project preserves web content around events, such as the US National Elections or September 11, or related themes such as public policy topics or the US Congress. They also archive their own Web site at loc.gov. Las Vegas-Clark County Library District virtual library manager Lauren Stokes said they archive their video and audio content in a variety of media. They use local server storage, portable hard drive backups as well as CD backups. Server storage is also backed up on tapes rotated into cold storage. Boston Public Library's director of administration and technology David Leonard said their digitization efforts are focused on accessibility. Web portal accessibility -- whether as part of their own web presence or the positing of materials to other Internet sites as well as some social media sites -- all help with accessibility. Adapted from the source document.
DA  - 2013/03//
PY  - 2013
VL  - 37
IS  - 2
SP  - 44
EP  - 46
LA  - English
SN  - 2324-9684, 2324-9684
UR  - https://search.proquest.com/docview/1417518328?accountid=27464
KW  - Digital preservation
KW  - Web sites
KW  - Libraries
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Methods
KW  - Storage
ER  - 

TY  - GEN
TI  - Descriptive metadata for web archiving: Review of harvesting tools
AU  - OCLC
AB  - OCLC Research Library Partnership Web Archiving Working Group, Tools Subgroup's objective analysis of 11 tools designed to extract descriptive metadata from harvested web content. Selected tools included those tools that harvest or replay web content, are actively under development and/or are actively supported, and appeared to include descriptive metadata capture features. Tools reviewed include: Archive-It, Heritrix, HTTrack, Memento, Netarchive Suite, SiteStory, Social Feed Manager, Wayback Machine, Web Archive.
DA  - 2018///
PY  - 2018
PB  - OCLC Research
UR  - https://www.oclc.org/research/publications/2018/oclcresearch-descriptive-metadata/recommendations.html
Y2  - 2020/08/14/
KW  - Web archiving
KW  - Archives
KW  - Electronic information resources--Management
KW  - Application software--Reviews
ER  - 

TY  - BOOK
TI  - Getting to Know Our Web Archive: A Pilot Project to Collaboratively Increase Access to Digital Cultural Heritage Materials in Wyoming
AU  - Amanda, Lehman R
AB  - The University of Wyoming is the only four year higher education institution in the state, a unique position amongst colleges and universities in the United States. Given this unusual status it is especially important that the university libraries use their resources to identify and partner with communities around the state to build collections that preserve their cultural heritage. An Archive-It subscription was purchased in 2016, with an initial goal of capturing university related materials. In an effort to expand the scope and meaningfulness of the web archive, a project has been undertaken to use university and statewide relationships to build a Wyoming focused Native American digital cultural heritage collection comprised of web-based materials. This is an interdepartmental effort led by the Digital Collections Librarian and the Metadata Librarian that includes collaboration within the library, the university, and the state.
CY  - United States, North America
DA  - 2018///
PY  - 2018
PB  - Digital USD
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - Archive-It
KW  - collaboration
KW  - metadata
KW  - access
KW  - Cataloging and Metadata
KW  - Collection Development and Management
KW  - Digital Humanities
KW  - outreach
KW  - WorldCat
ER  - 

TY  - GEN
TI  - A Grounded Theory of Information Quality for Web Archives
AU  - Reyes Ayala, Brenda
AB  - Presentation for the dissertation defense of Brenda Reyes Ayala. This presentation builds a theory of information quality for web archives that is grounded in human-centered data.
DA  - 2018///
PY  - 2018
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - grounded theory
KW  - information quality
ER  - 

TY  - JOUR
TI  - Hypertext and "Twitterature".
AU  - Lollini, Massimo
T2  - Profession
AB  - The article offer information on the Oregon Petrarch Open Book (OPOB), a database-driven hypertext version of the poetry collection "Rerum vulgarium fragmentata" (Rvf) by Francesco Petrarca. Topics discussed include the use of features of Web archive and hypertext for the creation of the database; the use of technology in teaching Petrarchism; and the archive of separate editions of Rvf in the database.
DA  - 2018/03/22/
PY  - 2018
SP  - 1
SN  - 07406959
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://profession.mla.hcommons.org/2018/03/22/hypertext-and-twitterature/
KW  - WEB archiving
KW  - DIGITAL libraries
KW  - 1304-1374
KW  - EDITIONS
KW  - Francesco
KW  - HYPERTEXT systems
KW  - PETRARCA
KW  - PETRARCHISM
KW  - POETRY collections
ER  - 

TY  - JOUR
TI  - The Website 'Archiwum Internetu' Against a Background of Problems with Archiving Web Resources TT  - Serwis "Archiwum Internetu" na tle ogolnych problemow archiwizacji zasobow sieciowych
AU  - Klebczyk, Filip
T2  - Biuletyn EBIB
AB  - The article is an analysis of the possibilities and actions undertaken in the field of archiving Polish web resources and making them accessible. From the point of view of the portal, there are not only the financial and technical barriers, but also legal regulations are important. To a significant extent, the law puts constraints on this activity, especially when it comes to making the resources accessible. The article presents and overview of the international experience with web archiving and with the techniques used commonly in archiving and using open access for such resources. The Polish project 'Archiwum Internetu' is also discussed. 'Archiwum Internetu' is created by the National Digital Archives -- its legal foundation, present state and the development direction for this and similar projects are also included in the article. Adapted from the source document.
DA  - 2012///
PY  - 2012
IS  - 1
LA  - Polish
SN  - 1507-7187, 1507-7187
UR  - https://search.proquest.com/docview/1266143222?accountid=27464
L4  - http://www.ebib.info/biuletyn/
KW  - Web archiving
KW  - Digital preservation
KW  - Poland
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Barriers
KW  - Internet Archive, Polish Internet resources
ER  - 

TY  - JOUR
TI  - Archiving Websites in the Nordic Countries TT  - Archiwizowanie stron internetowych w krajach nordyckich
AU  - Nalewajska, Lilianna
T2  - Biuletyn EBIB
AB  - The Nordic countries (Norway, Sweden, Finland, Denmark and Iceland) are the pioneers of web archiving. The process of collecting materials from the web requires arrangements concerning technical-technological, legal and organization issues, was started in these countries in the late 1990s or in the beginning of the 21st century. Archiving is being carried out mainly in national libraries, which also cooperate with International Internet Preservation Consortium and co-create Nordic Web Archive. The way of functioning and the difficulties which occur during archiving in Nordic countries show the complexity of the process and point out how important long-term planning is. Adapted from the source document.
DA  - 2012///
PY  - 2012
IS  - 1
LA  - Polish
SN  - 1507-7187, 1507-7187
UR  - https://search.proquest.com/docview/1266143226?accountid=27464
L4  - http://www.ebib.info/biuletyn/
KW  - Web archiving
KW  - National libraries
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Cooperation
KW  - Internet archiving, Web archiving, Web archive
KW  - Nordic countries
ER  - 

TY  - JOUR
TI  - Web Archiving -- the Situation in Polish Law from the Point of View of the Librarians TT  - Archiwizacja Internetu -- sytuacja w polskim prawie z punktu widzenia bibliotekarzy
AU  - Slaska, Katarzyna
AU  - Wasilewska, Anna
T2  - Biuletyn EBIB
AB  - The authors present the possibilities of archiving Polish web sites from the point of view of the librarians, focusing mainly on the legal aspects of this issue. The reflections are based on the Act on Legal Deposit Copies [Ustawa o obowiazkowych egzemplarzach bibliotecznych] currently in force in Poland and on selected legal interpretation. The authors present some materials on the situation in foreign countries and mention the international organization International Internet Preservation Consortium (IIPC). Adapted from the source document.
DA  - 2012///
PY  - 2012
IS  - 1
LA  - Polish
SN  - 1507-7187, 1507-7187
UR  - https://search.proquest.com/docview/1266143224?accountid=27464
L4  - http://www.ebib.info/biuletyn/
KW  - Web archiving
KW  - Web sites
KW  - Legal deposit
KW  - Poland
KW  - Librarians
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Internet archiving, web archiving, the Internation
KW  - Law
ER  - 

TY  - JOUR
TI  - The Future of the Past of the Web
AU  - Brack, Matthew
T2  - Ariadne
AB  - We have all heard at least some of the extraordinary statistics that attempt to capture the sheer size and ephemeral nature of the Web. According to the Digital Preservation Coalition (DPC), more than 70 new domains are registered and more than 500,000 documents are added to the Web every minute. This scale, coupled with its ever-evolving use, present significant challenges to those concerned with preserving both the content and context of the Web. Co-organised by the DPC, the British Library and JISC, this workshop was the third in a series of discussions around the nature and potential of Web archiving. Following the key note address, two thematic sessions looked at 'Using Web Archives' (as it is only recently that use cases have started to emerge) and 'Emerging Trends' (acknowledging that Web archiving activities are on the increase, along with a corresponding rise in public awareness). Adapted from the source document.
DA  - 2012/03//
PY  - 2012
DO  - http://www.ariadne.ac.uk/issue68/fpw11-rpt
IS  - 68
LA  - English
SN  - 1361-3200, 1361-3200
UR  - https://search.proquest.com/docview/1680142275?accountid=27464
L4  - http://www.ariadne.ac.uk/
KW  - Web archiving
KW  - article
KW  - 1.12: LIS - CONFERENCES
KW  - Workshops
ER  - 

TY  - JOUR
TI  - Functionalities of Web Archives
AU  - Niu, Jinfang
T2  - D-Lib Magazine
AB  - The functionalities that are important to the users of web archives range from basic searching and browsing to advanced personalized and customized services, data mining, and website reconstruction. The author examined ten of the most established English language web archives to determine which functionalities each of the archives supported, and how they compared. A functionality checklist was designed, based on use cases created by the International Internet Preservation Consortium (IIPC), and the findings of two related user studies. The functionality review was conducted, along with a comprehensive literature review of web archiving methods, in preparation for the development of a web archiving course for Library and Information School students. This paper describes the functionalities used in the checklist, the extent to which those functionalities are implemented by the various archives, and discusses the author's findings. Adapted from the source document.
DA  - 2012/03//
PY  - 2012
DO  - 10.1045/march2012-niu2
VL  - 18
IS  - 3-4
LA  - English
SN  - 1082-9873, 1082-9873
UR  - https://search.proquest.com/docview/1266143632?accountid=27464
L4  - http://www.dlib.org/dlib/march12/niu/03niu2.html
KW  - Web archiving
KW  - Web archive
KW  - article
KW  - Methods
KW  - 5.18: ELECTRONIC MEDIA
KW  - evaluation
KW  - Evaluation
KW  - functionality
KW  - overview
KW  - usability
KW  - Usability
ER  - 

TY  - JOUR
TI  - Web Archives on Both Sides of the Atlantic Ocean -- Internet Archive, Wayback Machine and UK Web Archive TT  - Archiwa internetowe po obu stronach Atlantyku Internet Archive, Wayback Machine oraz UK Web Archive
AU  - Gmerek, Katarzyna
T2  - Biuletyn EBIB
AB  - The article is a comparison of two web archives -- from the US and UK -- which differ in terms of storage rules and ways of using resources. The Wayback Machine is a private initiative, based on a private foundation, using the cooperation of volunteers in many countries in the world. It gathers world websites without any selection or censorship. UK Web Archive is an initiative of British libraries, which wish to fulfill the idea of legal deposit of British websites (which is currently unworkable because of underdeveloped laws). The websites are carefully selected and their content is evaluated by the librarians according to the importance and usefulness of the site. Adapted from the source document.
DA  - 2012///
PY  - 2012
IS  - 1
LA  - Polish
SN  - 1507-7187, 1507-7187
UR  - https://search.proquest.com/docview/1266143228?accountid=27464
L4  - http://www.ebib.info/biuletyn/
KW  - Web archiving
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Methods
KW  - Comparisons
KW  - Internet archiving, archiving websites, web archiv
KW  - UK
KW  - USA
ER  - 

TY  - JOUR
TI  - Archiving of Comprehensive Annual Financial Reports (CAFRs) on State Government Web Sites
AU  - Thornton, Joel B
T2  - Behavioral & Social Sciences Librarian
AB  - Rising cost and declining revenues have hampered the financial affairs of state governments, forcing many to curtail services, reduce employee benefits, and trim the workforce, calling into question the fiscal sustainability of many state governments. As a result, stakeholders are demanding greater accountability and increased transparency into state government finances. An important link or communication tool between state governments and stakeholders is the comprehensive annual financial report. The comprehensive annual financial report (CAFR), produced by state governments, provides some insight into how taxpayer dollars are spent and the benefits derived therefrom. This article analyzes the extent to which the states electronically archive the CAFR on their websites and the accessibility of the reports to users searching state government websites. Adapted from the source document.
DA  - 2012/04//
PY  - 2012
DO  - http://dx.doi.org/10.1080/01639269.2012.686244
VL  - 31
IS  - 2
SP  - 87
EP  - 95
LA  - English
SN  - 0163-9269, 0163-9269
UR  - https://search.proquest.com/docview/1550992606?accountid=27464
KW  - Web archiving
KW  - archives
KW  - Government information
KW  - article
KW  - 5.2: MATERIALS BY SUBJECTS
KW  - Access to information
KW  - CAFR
KW  - comprehensive annual financial report
KW  - Finance
KW  - Reports
KW  - State government
KW  - state government publications
KW  - web-based government publications
ER  - 

TY  - JOUR
TI  - Web Archives for Researchers: Representations, Expectations and Potential Uses
AU  - Stirling, Peter
AU  - Chevallier, Philippe
AU  - Illien, Gildas
T2  - D-Lib Magazine
AB  - The Internet has been covered by legal deposit legislation in France since 2006, making web archiving one of the missions of the Bibliotheque nationale de France (BnF). Access to the web archives has been provided in the library on an experimental basis since 2008. In the context of increasing interest in many countries in web archiving and how it may best serve the needs of researchers, especially in the expanding field of Internet studies for social sciences, a qualitative study was performed, based on interviews with potential users of the web archives held at the BnF, and particularly researchers working in various areas related to the Internet. The study aimed to explore their needs in terms of both content and services, and also to analyse different ways of representing the archives, in order to identify ways of increasing their use. While the interest of maintaining the 'memory' of the web is obvious to the researchers, they are faced with the difficulty of defining, in what is a seemingly limitless space, meaningful collections of documents. Cultural heritage institutions such as national libraries are perceived as trusted third parties capable of creating rationally-constructed and well-documented collections, but such archives raise certain ethical and methodological questions. Adapted from the source document.
DA  - 2012/03//
PY  - 2012
DO  - 10.1045/march2012-stirling
VL  - 18
IS  - 3-4
LA  - English
SN  - 1082-9873, 1082-9873
UR  - https://search.proquest.com/docview/1266143619?accountid=27464
L4  - http://www.dlib.org/dlib/march12/stirling/03stirling.html
KW  - Web archiving
KW  - National libraries
KW  - Researchers
KW  - article
KW  - 5.18: ELECTRONIC MEDIA
KW  - Bibliotheque Nationale de France
KW  - User needs
ER  - 

TY  - JOUR
TI  - Preserving born-digital catalogues raisonnés: Web archiving at the New York Art Resources Consortium (NYARC)
AU  - Duncan, Sumitra
T2  - Art Libraries Journal
DA  - 2015///
PY  - 2015
VL  - 40
IS  - 2
SP  - 50
EP  - 55
LA  - English
SN  - 03074722
UR  - https://search.proquest.com/docview/1693347798?accountid=27464
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Capture All the URLs: First Steps in Web Archiving
AU  - Antracoli, Alexis
AU  - Duckworth, Steven
AU  - Silva, Judith
AU  - Yarmey, Kristen
T2  - Pennsylvania Libraries
AB  - As higher education embraces new technologies, university activities--including teaching, learning, and research--increasingly take place on university websites, on university-related social media pages, and elsewhere on the open Web. Despite perceptions that "once it's on the Web, it's there forever," this dynamic digital content is highly vulnerable to degradation and loss. In order to preserve and provide enduring access to this complex body of university records, archivists and librarians must rise to the challenge of Web archiving. As digital archivists at our respective institutions, the authors introduce the concept of Web archiving and articulate its importance in higher education. We provide our institutions' rationale for selecting subscription service Archive-It as a preservation tool, outline the progress of our institutional Web archiving initiatives, and share lessons learned, from unexpected stumbling blocks to strategies for raising funds and support from campus stakeholders.
DA  - 2014///
PY  - 2014
DO  - http://dx.doi.org/10.5195/palrap.2014.67
VL  - 2
IS  - 2
SP  - 155
EP  - 170
LA  - English
UR  - https://search.proquest.com/docview/1634873262?accountid=27464
KW  - Digital libraries
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Archives & records
KW  - URLs
KW  - Library science
KW  - Higher education
ER  - 

TY  - JOUR
TI  - An Overview of Web Archiving
AU  - Niu, Jinfang
T2  - D-Lib Magazine
AB  - This overview is a study of the methods used at a variety of universities, and international government libraries and archives, to select, acquire, describe and access web resources for their archives. Creating a web archive presents many challenges, and library and information schools should ensure that instruction in web archiving methods and skills is made part of their curricula, to help future practitioners meet those challenges. In preparation for developing a web archiving course, the author conducted a comprehensive literature review. The findings are reported in this paper, along with the author's views on some of the methods in use, such as how traditional archive management concepts and theories can be applied to the organization and description of archived web resources. Adapted from the source document.
DA  - 2012/03//
PY  - 2012
DO  - 10.1045/march2012-niu1
VL  - 18
IS  - 3-4
LA  - English
SN  - 1082-9873, 1082-9873
UR  - https://search.proquest.com/docview/1266143627?accountid=27464
L4  - http://www.dlib.org/dlib/march12/niu/03niu1.html
KW  - Web archiving
KW  - Digital preservation
KW  - Web archive
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Methods
KW  - Government libraries
KW  - Universities
KW  - web archive methods
KW  - web resources
ER  - 

TY  - JOUR
TI  - Web Archiving Methods and Approaches: A Comparative Study
AU  - Masanès, Julien
T2  - Library Trends
AB  - The Web is a virtually infinite information space, and archiving its entirety, all its aspects, is a utopia. The volume of information presents a challenge, but it is neither the only nor the most limiting factor given the continuous drop in storage device costs. Significant challenges lie in the management and technical issues of the location and collection of Web sites. As a consequence of this, archiving the Web is a task that no single institution can carry out alone. This article will present various approaches undertaken today by different institutions; it will discuss their focuses, strengths, and limits, as well as a model for appraisal and identifying potential complementary aspects amongst them. A comparison for discovery accuracy is presented between the snapshot approach done by the Internet Archive (IA) and the event-based collection done by the Bibliothèque Nationale de France (BNF) in 2002 for the presidential and parliamentary elections. The balanced conclusion of this comparison allows for identification of future direction for improvement of the former approach. [PUBLICATION ABSTRACT]
DA  - 2005///
PY  - 2005
VL  - 54
IS  - 1
SP  - 72
EP  - 90
LA  - English
SN  - 00242594
UR  - https://search.proquest.com/docview/220467286?accountid=27464
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Internet
KW  - Information management
KW  - Web sites
KW  - Elections
ER  - 

TY  - JOUR
TI  - Archiving the Internet - Web pages of political parties
AU  - Peach, M
T2  - Assignation
AB  - The Internet has great potential as a source of grey literature. Describes the efforts of the Centro de Estudios Avanzados en Ciencias Sociales (CEACS) of the Instituto Juan March in Madrid, Spain, to take advantage of that potential as a source for researchers present and future. Discusses the following: public use of the Internet in Spain; profile of the CEACS project; nature of political party pages; current status of the project; problems and technical needs; and project expansion.
DA  - 1998/07//
PY  - 1998
VL  - 15
IS  - 4
SP  - 54
EP  - 58
LA  - English
SN  - 0265-2587, 0265-2587
UR  - https://search.proquest.com/docview/57443754?accountid=27464
KW  - Politics
KW  - Grey literature
KW  - Instituto Juan March, Spain Centro de Estudios Ava
KW  - Online information retrieval
KW  - Spain
ER  - 

TY  - JOUR
TI  - Internet Archive joins history's great libraries
AU  - O'Leary, Mick
T2  - Information Today
AB  - Brewster Kahle is a man of many roles: a famous Internet pioneer, a successful dot-com entrepreneur, a digital visionary, and a darned good librarian. Right now, he's best-known as the founder of Alexa and the WAIS system. However, with Kahle's creation of the Internet Archive (IA), the future may well ascribe greater importance to his work as a librarian. IA is the largest archival project in history. Kahle compares it - without presumption or exaggeration - to the ancient Library of Alexandria. It intends to do for the Internet what that great library did for antiquity: to capture and preserve the world's knowledge for everyone's benefit. IA has been hard at work for several years creating the largest database in the world. At first, it concentrated on preservation. Now, with that task well in hand, it's working on access tools for this unique information resource.
DA  - 2003/11//
PY  - 2003
VL  - 20
IS  - 10
SP  - 41
LA  - English
UR  - https://search.proquest.com/docview/214817883?accountid=27464
KW  - Digital libraries
KW  - Library And Information Sciences--Computer Applica
KW  - Online data bases
KW  - 9190:United States
KW  - 5240:Software & systems
KW  - 9120:Product specific
KW  - Software reviews
KW  - United States
KW  - US
ER  - 

TY  - JOUR
TI  - Problems of Long-Term Preservation of Web Pages TT - Problematika Dolgorocne Hrambe Spletnih Strani
AU  - Decman, Mitja
T2  - Knjiznica
AB  - The World Wide Web is a distributed collection of web sites available on the Internet anywhere in the world. Its content is constantly changing: old data are being replaced which causes constant loss of a huge amount of information and consequently the loss of scientific, cultural and other heritage. Often, unnoticeably even legal certainty is questioned. In what way the data on the web can be stored and how to preserve them for the long term is a great challenge. Even though some good practices have been developed, the question of final solution on the national level still remains. The paper presents the problems of long-term preservation of web pages from technical and organizational point of view. It includes phases such as capturing and preserving web pages, focusing on good solutions, world practices and strategies to find solutions in this area developed by different countries. The paper suggests some conceptual steps that have to be defined in Slovenia which would serve as a framework for all document creators in the web environment and therefore contributes to the consciousness in this field, mitigating problems of all dealing with these issues today and in the future. Adapted from the source document.
DA  - 2011///
PY  - 2011
VL  - 55
IS  - 1
SP  - 193
EP  - 208
LA  - Slovene
SN  - 0023-2424, 0023-2424
UR  - https://search.proquest.com/docview/1266143497?accountid=27464
L4  - https://knjiznica.zbds-zveza.si/knjiznica/article/download/6010/5657
KW  - Web archiving
KW  - Web archives
KW  - Digital preservation
KW  - Web pages
KW  - long-term preservation
KW  - harvesting
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Slovenia
KW  - web pages
ER  - 

TY  - JOUR
TI  - Metadata Mix and Match
AU  - Coyle, Karen
T2  - Information Standards Quarterly
AB  - The author was asked to consult with the Internet Archive's Open Library project primarily to lend her expertise in bibliographic data. To her dismay, the Open Library data did not look anything like library bibliographic data. She learned, however, that there were some good reasons for this. The first was that the Open Library was not limiting itself to library data. Another reason the Open Library does not limit itself to the more rigorous library data style was that the Open Library allows editing of its data by the general public: people with no particular bibliographic training. The most compelling reason to deviate from the standard view posited by library bibliographic data, however, has to do with the concept of linked data. It's an unfortunate fact that many systems combine data from different sources using only the "dumb down" method, reducing the metadata to the few matching elements and resulting in the least rich metadata record possible.
DA  - 2009///
PY  - 2009
VL  - 21
IS  - 1
SP  - 8
EP  - 11
LA  - English
UR  - https://search.proquest.com/docview/1735033500?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Metadata
KW  - Data bases
KW  - Bibliographic records
KW  - Curriculum development
KW  - Information sources
KW  - Library cataloging
KW  - Resource Description Framework-RDF
KW  - Standards
ER  - 

TY  - JOUR
TI  - Growing an Archives Department: (and other concerns of a new library manager)
AU  - Marciniak, Joe
T2  - Computers in Libraries
AB  - The difference between a librarian and an archivist was librarian will drill, glue, and tape a resource to get it back in the stacks. An archivist will seal, hide, and lock up a resource to preserve it. In other words, the difference between a librarian and an archivist is everything. Librarians and archivists just have different professional philosophies. It comes down to access versus preservation. Although the archives department had existed in various ways for many decades, it was only given a permanent library home in 2009. The library's mission statement outlines the overall goal of the library: to provide quality resources, a high level of service, and innovative learning environments with leading-edge technology. Providing quality resources was where the author felt the archives department could fit into the overall mission of the library. The mission statement of your institution is an essential starting point for establishing common ground with a colleague when working on a project.
DA  - 2015/04//
PY  - 2015
VL  - 35
IS  - 3
SP  - 16
EP  - 19
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/1680526979?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digitization
KW  - Archives & records
KW  - Library collections
KW  - Metadata
KW  - Librarians
KW  - Archivists
KW  - Library managers
KW  - Meetings
KW  - Mission statements
ER  - 

TY  - JOUR
TI  - The California Light and Sound Collection: Preserving Our Media Heritage
AU  - Hulser, Richard P
T2  - Computers in Libraries
AB  - While most of the focus on digital preservation and access has been on digitizing printed materials, there is an initiative underway in California to capture and make accessible audiovisual content in such a way that even libraries, museums, and archives with limited resources can participate. The California Light and Sound collection is the outgrowth of the California Preservation Program's California Audiovisual Preservation Project (CAVPP). CAVPP plays the lead role in helping participating partner organizations conserve and preserve their audiovisual collections according to best practices for the archiving and preservation of moving image and sound formats. It also established a low-cost and practical workflow for helping partner organizations efficiently digitize key media artifacts. CAVPP coordinates all digitization activities with the vendor doing the digitization work and helps the participating institution throughout the process. To optimize quality control, CAVPP prefers working with labs that can handle all audiovisual formats. This not only saves shipping costs but ensures that the appropriate standards and procedures are applied to all recordings.
DA  - 2015/04//
PY  - 2015
VL  - 35
IS  - 3
SP  - 4
EP  - 10
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/1680527010?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digitization
KW  - Archives & records
KW  - Internet
KW  - Library collections
KW  - Metadata
KW  - Nominations
KW  - Museums
KW  - Public access
KW  - California
KW  - Costs
KW  - Disk drives
KW  - Local history
KW  - Quality control
ER  - 

TY  - JOUR
TI  - Living Movements, Living Archives: Selecting and Archiving Web Content During Times of Social Unrest
AU  - Rollason-Cass, Sylvie
AU  - Reed, Scott
T2  - New Review of Information Networking
AB  - The ease of creating and sharing content on the web has had a profound impact on the scope, pace, and mobility of social movements, as well as on how the documents and evidence of these movements are collected and preserved. This article will focus on the process of creating a web based archive around the #blacklivesmatter movement while exploring the concept of the "living archive" through collaborative collection building around social movements. By examining this and other event-based web collections, best practices and strategies to improve the process of selection and capture of web content in Living Archives are presented.
DA  - 2015///
PY  - 2015
DO  - http://dx.doi.org/10.1080/13614576.2015.1114839
VL  - 20
IS  - 1-2
SP  - 241
EP  - 247
LA  - English
SN  - 1361-4576
UR  - https://search.proquest.com/docview/1877779886?accountid=27464
KW  - Web archiving
KW  - web archiving
KW  - Computers--Internet
KW  - 3.2:ARCHIVES
KW  - cultural responsibility
KW  - living archives
KW  - Social activism
KW  - social movements
ER  - 

TY  - JOUR
TI  - Personal Archiving: Preserving Our Digital Heritage
AU  - Brown, Karen E K
T2  - Library Resources & Technical Services
AB  - Extending from this, Danielle Conklin, author of chapter 2, "Personal Archiving for Individuals and Families" does a terrific job of outlining risks, such as obsolescence of the formats and software, the need to migrate information forward, the importance of keeping your collections organized, and distributing copies to assist preservation efforts. Richard Banks goes on in "Our Technology Heritage" in chapter 11 about devices that could, if ever fully developed, bring our digital lives into our physical lives (he values the idea of displaying our digital images, for example, in our homes), but ri ght now little boxes that sit around and salvage and store information don't seem exactly visionär)'.
DA  - 2015/04//
PY  - 2015
VL  - 59
IS  - 2
SP  - 94
EP  - 95
LA  - English
SN  - 00242527
UR  - https://search.proquest.com/docview/1684295944?accountid=27464
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Books
KW  - Library collections
KW  - Essays
KW  - Institutional repositories
KW  - Reading
ER  - 

TY  - JOUR
TI  - Building the Foundation: Creating an Electronic-Records Program at the University of Miami
AU  - Capell, Laura
T2  - Computers in Libraries
AB  - Developing and implementing effective strategies to manage electronic records (e-records) is one of the biggest challenges facing the archives field today, as they acquire growing quantities of contemporary records generated by an increasingly digital society. However, jumping into e-records archiving can be a daunting task. As the author's continue to move through the pilot project and develop their policies and procedures for born-digital content, they're looking ahead at the next steps. First of all, they want to build more robust digital forensics workflows, including exploring methods for more extensive analysis of their digital content and developing workflows to handle a wider range of media and formats. Second, they want to use the results of their survey to start processing legacy media in their collections. Finally, they want to explore more options for providing access so that they can effectively make a wide range of born-digital content available for research.
DA  - 2015/11//
PY  - 2015
VL  - 35
IS  - 9
SP  - 28
EP  - 32
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/1755071188?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Academic libraries
KW  - Archives & records
KW  - Library collections
KW  - Social networks
KW  - Colleges & universities
KW  - Archivists
KW  - Digital video
KW  - Electronic records
KW  - Pilot projects
KW  - Special collections
KW  - Video recordings
ER  - 

TY  - JOUR
TI  - Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality
AU  - Gray, Gabriella
AU  - Martin, Scott
T2  - D-Lib Magazine
AB  - The UCLA Online Campaign Literature Archive has been collecting websites from Los Angeles and California elections since 1998. Over the years the number of websites created for these campaigns has soared while the staff manually capturing the websites has remained constant. By 2012 it became apparent that we would need to find a more sustainable model if we were to continue to archive campaign websites. Our ideal goal was to find an automated tool that could match the high quality captures produced by the Archive's existing labor-intensive manual capture process. The tool we chose to investigate was the California Digital Library's Web Archiving Service (WAS). To test the quality of WAS captures we created a duplicate capture of the June 2012 California election using both WAS and our manual capture and editing processes. We then compared the results from the two captures to measure the relative quality of the two captures. This paper presents the results of our findings and contributes a unique empirical analysis of the quality of websites archived using two divergent web archiving methods and sets of tools. Adapted from the source document.
DA  - 2013/05//
PY  - 2013
DO  - http://dx.doi.org/10.1045/may2013-gray
VL  - 19
IS  - 5-6
LA  - English
SN  - 1082-9873, 1082-9873
UR  - https://search.proquest.com/docview/1735638237?accountid=27464
L4  - http://www.dlib.org/dlib/may13/gray/05gray.html
KW  - Web archiving
KW  - Web Archiving
KW  - Web sites
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Methods
KW  - CDL Web Archiving Service
KW  - Comparisons
KW  - Politics
KW  - Quality
KW  - UCLA Online Campaign Literature Archive
ER  - 

TY  - JOUR
TI  - Not all mementos are created equal: measuring the impact of missing resources
AU  - Brunelle, Justin F
AU  - Kelly, Mat
AU  - Salaheldeen, Hany
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - International Journal on Digital Libraries
AB  - (ProQuest: ... denotes formulae and/or non-USASCII text omitted; see image) Issue Title: Focused Issue on Digital Libraries 2014 Web archives do not always capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users' perceptions of damage are not accurately estimated by the proportion of missing embedded resources. In fact, the proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17 % and an improvement by 51 % if the mementos have a damage rating delta ......0.30. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from a damage rating of 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time. Alternatively, the damage in WebCite is increasing over time (going from 0.375 in 2007 to 0.475 in 2014), while the missing embedded resources remain constant (13 % of the resources are missing on average). Finally, we investigate the impact of JavaScript on the damage of the archives, showing that a crawler that can archive JavaScript-dependent representations will reduce memento damage by 13.5 %.
DA  - 2015/09//
PY  - 2015
DO  - http://dx.doi.org/10.1007/s00799-015-0150-6
VL  - 16
IS  - 3-4
SP  - 283
EP  - 301
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/1703891222?accountid=27464
KW  - Web archiving
KW  - Digital libraries
KW  - Digital preservation
KW  - Library And Information Sciences--Computer Applica
KW  - World Wide Web
KW  - Digital archives
KW  - Web architecture
KW  - Memento damage
ER  - 

TY  - JOUR
TI  - Archiving the Web, A Service Construct TT  - Archiver le Web, un service en construction
AU  - Lasfargues, France
AU  - Medjkoune, Leila
T2  - Documentaliste - Sciences de l'Information
AB  - Archiving the Web is an old problem that is taking shape, accompanied by new businesses contours. This article gives a few reminders of technical, historical and legal issues of web archiving before discussing the tasks entrusted to a Web archivist. The article outlines the context of Web archiving. Several duties involved in the work of web archivists are outlined: enriching collections, managing a budget, controlling the quality of the collection, giving access to the archive and preserving web content. Adapted from the source document.
DA  - 2012/09//
PY  - 2012
VL  - 49
IS  - 3
SP  - 8
EP  - 9
LA  - French
SN  - 0012-4508, 0012-4508
UR  - https://search.proquest.com/docview/1283633770?accountid=27464
KW  - Web archiving
KW  - article
KW  - 2.14: LIS - TYPES OF STAFF
KW  - Professional responsibilities
KW  - Role
ER  - 

TY  - JOUR
TI  - Who Gets to Die of Dysentery?: Ideology, Geography, and The Oregon Trail
AU  - Slater, Katharine
T2  - Children's Literature Association Quarterly
AB  - This article examines the co-constitutive relationship between ideology and geography in three editions of the educational computer game The Oregon Trail, arguing that the game reinforces a colonialist worldview through representations of place, space, and time. Despite seeming to accommodate players of any race or gender, The Oregon Trail imagines its protagonist-the "you" traveling the Trail-as white and male, a construct that reinforces the supremacist narrative of nineteenth-century settlement. Through a rapid in-game compression of time and space that urges progress, the game encourages child players to perform the spatialized worldview that codifies manifest destiny.
DA  - 2017///
PY  - 2017
DO  - http://dx.doi.org/10.1353/chq.2017.0040
VL  - 42
IS  - 4
SP  - 374
EP  - 395
LA  - English
SN  - 08850429
UR  - https://search.proquest.com/docview/2009304874?accountid=27464
KW  - Web archiving
KW  - Education
KW  - Library And Information Sciences
KW  - Archives & records
KW  - 19th century
KW  - American history
KW  - Colonialism
KW  - Computer & video games
KW  - Consortia
KW  - Extremism
KW  - Geography
KW  - Information literacy
KW  - Minnesota
KW  - Native North Americans
KW  - New York
KW  - Oregon
KW  - Oregon Trail
KW  - Race
KW  - Racism
KW  - School districts
KW  - Trails
KW  - United States--US
KW  - White supremacists
ER  - 

TY  - JOUR
TI  - End of Term 2016 Presidential Web Archive
AU  - Phillips, Mark E
AU  - Phillips, Kristy K
T2  - Against the Grain
AB  - During every Presidential election in the US since 2008, a group of librarians, archivists, and technologists representing institutions across the nation can be found hard at work, preserving the federal web domain and documenting the changes that occur online during the transition. Anecdotally, evidence exists that the data available on the federal web changes after each election cycle, either as a new president takes office, or when an incumbent president changes messages during the transition into a new term of office. Until 2004, nothing had been done to document this change. Originally, the National Archives and Records Administration (NARA) conducted the first large-scale capture of the federal web at the end of George W. Bush’s first term in office in 2004. This is noteworthy because, while institutions like the Library of Congress, the Government Publishing Office, and NARA itself have web archiving as part of their imperative, none of their mandates are so broad as to cover the capture and preservation of the entirety of the federal web.
DA  - 2018///
PY  - 2018
VL  - 29
IS  - 6
SP  - 27
LA  - English
SN  - 1043-2094
UR  - https://search.proquest.com/docview/2077076158?accountid=27464
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Presidential elections
ER  - 

TY  - JOUR
TI  - Ereviews
AU  - Price, Henrietta Verma and Gary
T2  - Library Journal
AB  - According to IA founder Brewster Kahle, the BPL collection includes "hillbilly music, early brass bands, and accordion recordings from the turn of the last century, offering an authentic audio portrait of how America sounded a century ago. The Presidential Records Act has in the past been understood to mean that executive branch administrative communication must be archived, but the U.S. Justice Department is moving to dismiss the lawsuit, saying that the president has authority over what is saved in accordance with the act. [...]FCW , a publication for federal technology executives, quotes Jason R. Baron, formerly chief litigator for the National Archives and Records Administration: "If White House counsel reads [the statute] narrowly...resulting in White House staff not being required to copy or transfer presidential records to an official electronic account before individual communications self-destruct, is that decision reviewable?" For further information on this case, see ow.ly/RAnE30fT8Yc.
DA  - 2017/11/15/
PY  - 2017
VL  - 142
IS  - 19
SP  - 100
LA  - English
SN  - 03630277
UR  - https://search.proquest.com/docview/1964143235?accountid=27464
KW  - Web archiving
KW  - Communication
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Digitization
KW  - Books
KW  - Internet
KW  - Library collections
KW  - Metadata
KW  - Copyright
KW  - Litigation
KW  - Online instruction
ER  - 

TY  - JOUR
TI  - Partnerships on Campus: Roles and Impacts on developing a New Online Research Resource at Boston College
AU  - Kowal, Kimberly C
AU  - Meehan, Seth
T2  - The Catholic Library World
AB  - A partnership at Boston College (BC) between the Libraries and the Institute for Advanced Jesuit Studies resulted in a blossoming of services and resources, made possible via a combination of discipline-focused scholarship and library digital expertise. With a shared, mission, the last three years have produced a number of programs and projects that relied upon a relationship of reciprocation, support, and ultimately the strategic directions guiding this Jesuit university.
DA  - 2018/03//
PY  - 2018
VL  - 88
IS  - 3
SP  - 177
EP  - 184
LA  - English
SN  - 0008820X
UR  - https://search.proquest.com/docview/2024455820?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Digital libraries
KW  - Digital preservation
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Boston Massachusetts
KW  - Digitization
KW  - Partnerships
KW  - Religious missions
KW  - Religious orders
ER  - 

TY  - JOUR
TI  - The Online Media Environment of the North Caucasus: Issues of Preservation and Accessibility in a Zone of Political and Ideological Conflict
AU  - Condill, Kit
T2  - Preservation, Digital Technology & Culture
AB  - As one of the world's most ethnolinguistically-diverse and conflict-prone regions, the North Caucasus presents particular challenges for librarians seeking to preserve its rich and varied online news media content. This content is generated in multiple languages in multiple political and ideological contexts, both within the North Caucasus region and abroad. While online news media content in general is ephemeral, poorly-preserved, and difficult to access via any single search interface or search strategy, content relating to the North Caucasus is at additional risk due to ongoing insurgency/counterinsurgency activity, as well as historical, political and linguistic factors. Various options for preserving and searching North Caucasus web content are explored.
DA  - 2017///
PY  - 2017
DO  - http://dx.doi.org/10.1515/pdtc-2016-0022
VL  - 45
IS  - 4
SP  - 166
EP  - 176
LA  - English
SN  - 21952957
UR  - https://search.proquest.com/docview/1868026247?accountid=27464
KW  - web archiving
KW  - Library And Information Sciences
KW  - Chechnya
KW  - conflict zones
KW  - North Caucasus
KW  - online media
ER  - 

TY  - JOUR
TI  - Search Engine update
AU  - Notess, Greg R
T2  - Online Searcher
AB  - Searchers can change the region in the settings, by adding ?gl=TLD (replace TLD with the country top level domain) to a Google search results URL, or use a VPN to instead mimic being in the other country.The prefix commands of info: and id: followed by a URL no longer display links to the cache copy, related pages, incoming links (not surprising since the link search capability in Google was abandoned previously), site search, and term matches.The partnership aims to increase the number of trained fact checkers, expand fact-checking capabilities to more countries, provide access to various fact-checking tools for free, and develop new fact-checking software tools to improve efficiency.
DA  - 2018///
PY  - 2018
VL  - 42
IS  - 1
SP  - 8
EP  - 9
LA  - English
SN  - 23249684
UR  - https://search.proquest.com/docview/1989831908?accountid=27464
KW  - Web archiving
KW  - Archives & records
KW  - Books
KW  - Computers--Internet
KW  - Digital broadcasting
KW  - Internet
KW  - Library collections
KW  - Search engines
ER  - 

TY  - JOUR
TI  - Evolution of legal deposit in New Zealand
AU  - Cadavid, Jhonny Antonio Pabón
T2  - IFLA Journal
AB  - The evolution of legal deposit shows changes and challenges in collecting, access to and use of documentary heritage. Legal deposit emerged in New Zealand at the beginning of the 20th century with the aim of preserving print publications mainly for the use of a privileged part of society. In the 21st century legal deposit has evolved to include the safeguarding of electronic resources and providing access to the documentary heritage for all New Zealanders. The National Library of New Zealand has acquired new functions for a proper stewardship of digital heritage. E-deposit and web harvesting are two new mechanisms for collecting New Zealand publications. The article proposes that legal deposit through human rights and multiculturalism should involve different communities of heritage in web curation.
DA  - 2017/12//
PY  - 2017
DO  - http://dx.doi.org/10.1177/0340035217713763
VL  - 43
IS  - 4
SP  - 379
EP  - 390
LA  - English
SN  - 0340-0352
UR  - https://search.proquest.com/docview/1979964191?accountid=27464
KW  - web archiving
KW  - Library And Information Sciences
KW  - Cultural pluralism
KW  - Digital heritage
KW  - Human rights
KW  - Legal deposit
KW  - Multiculturalism & pluralism
KW  - national library
KW  - New Zealand
KW  - Publications
KW  - Twenty first century
ER  - 

TY  - JOUR
TI  - Developing and raising awareness of the zine collections at the British Library
AU  - Cox, Debbie
T2  - Art Libraries Journal
AB  - This article presents a practice-based account of collection development related to zines in the British Library. Rather than making the case for the collecting of zines, it aims to describe the process of collection building in a specific time and place, so that researchers have a better understanding of why certain resources are offered to them and others are not, and to share experiences with other librarians with zine collections. Zines form an element of the cultural memory of activists and cultural creators, and for researchers studying them it would seem useful to make transparent the motivations, methods and limitations of collection building. Librarians in the USA have written about their collecting practices for some time, for instance at Barnard College1and New York Public Library2, there has been less written about the practices of UK libraries. The article aims to make a contribution as a case study alongside accounts of collection development in a range of other libraries with zine collections, and it is written primarily from my own perspective as a curator in Contemporary British Collections since 2015, focusing on current practice, with some reference to earlier collecting.
DA  - 2018/04//
PY  - 2018
DO  - http://dx.doi.org/10.1017/alj.2018.5
VL  - 43
IS  - 2
SP  - 77
EP  - 81
LA  - English
SN  - 03074722
UR  - https://search.proquest.com/docview/2018595315?accountid=27464
KW  - Web archiving
KW  - Collection development
KW  - Library And Information Sciences
KW  - Library collections
KW  - Cultural heritage
KW  - Depository libraries
KW  - Donations
KW  - National libraries
KW  - Research
KW  - Researchers
KW  - United Kingdom--UK
ER  - 

TY  - JOUR
TI  - The digitization of early English books: A database comparison of Internet Archive and Early English Books Online
AU  - Brightenburg, Cindy
T2  - Journal of Electronic Resources Librarianship
AB  - The use of digital books is diverse, ranging from casual reading to in-depth primary source research. Digitization of early English printed books in particular, has provided greater access to a previously limited resource for academic faculty and researchers. Internet Archive, a free, internet website and Early English Books Online, a subscription based database are two such resources. This study compares the scope, coverage and visual quality of the two book databases to determine the usability of each for faculty and researchers.
DA  - 2016///
PY  - 2016
DO  - http://dx.doi.org/10.1080/1941126X.2016.1130448
VL  - 28
IS  - 1
SP  - 1
EP  - 8
LA  - English
SN  - 1941126X
UR  - https://search.proquest.com/docview/1780165316?accountid=27464
KW  - Web archiving
KW  - Internet Archive
KW  - Library And Information Sciences
KW  - Digitization
KW  - Archives & records
KW  - Internet resources
KW  - E-books
KW  - Data bases
KW  - database searching
KW  - Digital books
KW  - Early English Books Online
KW  - English literature
KW  - Library science
ER  - 

TY  - JOUR
TI  - Bringing Your Physical Books to Digital Learners via the Open Library Project
AU  - Kubilius, Ramune
T2  - Against the Grain
AB  - Kahle, the founder and digital librarian of Internet Archive, is a visionary, to be sure, and his plenary presentation in Charleston was sincere and enthusiastic. It was quite impressive to hear how many patrons visit Internet Archive each day (3-4 million), that there are 170 staff, and 500 libraries and university partners. It is not hard to believe that the average life of a web page is (only) 100 days before it is deleted or changed.
DA  - 2018/04//
PY  - 2018
VL  - 30
IS  - 2
SP  - 63
LA  - English
SN  - 1043-2094
UR  - https://search.proquest.com/docview/2077576451?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Archives & records
ER  - 

TY  - JOUR
TI  - Avoiding spoilers: wiki time travel with Sheldon Cooper
AU  - Jones, Shawn M
AU  - Nelson, Michael L
AU  - Van de Sompel, Herbert
T2  - International Journal on Digital Libraries
AB  - A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if fans are behind in their viewing they run the risk of encountering “spoilers”—information that gives away key plot points before the intended time of the show’s writers. Because the wiki history is indexed by revisions, finding specific dates can be tedious, especially for pages with hundreds or thousands of edits. A wiki’s history interface does not permit browsing across historic pages without visiting current ones, thus revealing spoilers in the current page. Enterprising fans can resort to web archives and navigate there across wiki pages that were live prior to a specific episode date. In this paper, we explore the use of Memento with the Internet Archive as a means of avoiding spoilers in fan wikis. We conduct two experiments: one to determine the probability of encountering a spoiler when using Memento with the Internet Archive for a given wiki page, and a second to determine which date prior to an episode to choose when trying to avoid spoilers for that specific episode. Our results indicate that the Internet Archive is not safe for avoiding spoilers, and therefore we highlight the inherent capability of fan wikis to address the spoiler problem internally using existing, off-the-shelf technology. We use the spoiler use case to define and analyze different ways of discovering the best past version of a resource to avoid spoilers. We propose Memento as a structural solution to the problem, distinguishing it from prior content-based solutions to the spoiler problem. This research promotes the idea that content management systems can benefit from exposing their version information in the standardized Memento way used by other archives. We support the idea that there are use cases for which specific prior versions of web resources are invaluable.
DA  - 2018/03//
PY  - 2018
DO  - http://dx.doi.org/10.1007/s00799-016-0200-8
VL  - 19
IS  - 1
SP  - 77
EP  - 93
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/2002183210?accountid=27464
KW  - Web archiving
KW  - Archives
KW  - Digital preservation
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Internet
KW  - Browsing
KW  - Content management systems
KW  - HTTP
KW  - Information management
KW  - Information resources
KW  - Internet resources
KW  - Management systems
KW  - Resource versioning
KW  - Spoilers
KW  - Time travel
KW  - Web sites
KW  - Wikis
ER  - 

TY  - JOUR
TI  - Now You See It, Now You Don't. Unless ...
AU  - Kennedy, Shirley Duglin
T2  - Information Today
AB  - According to Jill Lepore, the average life of a webpage is 100 days. As she notes, the embarrassing stuff seems to stick around a lot longer, but it's an indisputable fact that web-based content often goes missing: corporate reports, scholarly articles, government documents, working papers, maps, and creative works of all sorts. The Internet Archive and its Wayback Machine are pretty much universally loved by information professionals. You already know this, but aside from the Wayback Machine's valuable research function, the Internet Archive itself is a major time suck. Entertainment value aside, in late October, the Internet Archive announced on its blog that "with generous support from the Laura and John Arnold Foundation," it was planning to build "the Next Generation Wayback Machine".
DA  - 2015/12//
PY  - 2015
VL  - 32
IS  - 10
SP  - 8
LA  - English
UR  - https://search.proquest.com/docview/1761628166?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet
KW  - Information professionals
KW  - 5.18:ELECTRONIC MEDIA
KW  - High density storage
ER  - 

TY  - JOUR
TI  - Design and implementation of crawling algorithm to collect deep web information for web archiving
AU  - Oh, Hyo-Jung
AU  - Dong-Hyun, Won
AU  - Kim, Chonghyuck
AU  - Park, Sung-Hee
AU  - Kim, Yong
T2  - Data Technologies and Applications
AB  - PurposeThe purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.Design/methodology/approachThis study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.FindingsAmong the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.Research limitations/implicationsTo use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.Practical implicationsThe research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.Originality/valueThis study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.
DA  - 2018///
PY  - 2018
DO  - http://dx.doi.org/10.1108/DTA-07-2017-0053
VL  - 52
IS  - 2
SP  - 266
EP  - 277
LA  - English
SN  - 25149288
UR  - https://search.proquest.com/docview/2083825786?accountid=27464
KW  - Web archiving
KW  - Archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Web sites
KW  - Algorithms
KW  - Case depth
KW  - Electronic documents
KW  - Links
KW  - Visual programming languages
KW  - Webs
KW  - Websites
ER  - 

TY  - JOUR
TI  - API-based social media collecting as a form of web archiving
AU  - Littman, Justin
AU  - Chudnov, Daniel
AU  - Kerchner, Daniel
AU  - Peterson, Christie
AU  - Tan, Yecheng
AU  - Trent, Rachel
AU  - Vij, Rajat
AU  - Wrubel, Laura
T2  - International Journal on Digital Libraries
AB  - Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.
DA  - 2018/03//
PY  - 2018
DO  - http://dx.doi.org/10.1007/s00799-016-0201-7
VL  - 19
IS  - 1
SP  - 21
EP  - 38
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/2002183484?accountid=27464
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=128034081&S=R&D=a9h&EbscoContent=dGJyMMvl7ESep7U4v%2BvlOLCmsEieprNSsaa4S6%2BWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Web archiving
KW  - Web archives
KW  - Archives
KW  - Archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Researchers
KW  - Alignment
KW  - Data collection - Twitter
KW  - Digital media
KW  - Freeware
KW  - Media
KW  - Social media
KW  - Social networks
KW  - Open source software
KW  - Acquisition of data
KW  - Application program interfaces
ER  - 

TY  - JOUR
TI  - Uporabna vrednost podatkov spletnih zajemov: arhiviranje spletnih mest in analiza spletnih vsebin TT - The practical value of web capture data: archiving Web sites and Web content analysis
AU  - Kragelj, Matjaž
AU  - Kovačič, Mitja
T2  - Knjiznica
AB  - Zakon o obveznem izvodu publikacij (2006) Narodni in univerzitetni knjižnici (NUK) nalaga skrb za zajem, ohranjanje in nudenje dostopa uporabnikom do zajetih spletnih publikacij, spletnih mest in vsebin. Leta 2015 je NUK opravil prvi zajem slovenske domene .si, naslove spletnih domen je priskrbel Arnes (Akademska in raziskovalna mreža Slovenije). V príspevku se osredotočamo na pomen zajema spletnih vsebin zaradi vsakodnevnega propadanja spletnih domen. Poleg zajema in dejavnosti za zagotavljanje ohranjanja zajetih vsebin je v prispevku tematizirano tudi pridobivanje informacij iz nestrukturiranih vsebin (spletnih dokumentov). Omenjeni so primeri in delovanje aplikacij za zajemanje specifičnih informacij iz različnih spletnih dokumentov, npr. zajem cene določenega artikla v določeni trgovini z namenom obveščanja končnega uporabnika o najugodnejši ponudbi na trgu. Večji del prispevka je namenjen analizi zajetih spletnih vsebin in možnosti luščenja ter uteževanja besedišča, pridobljenega iz spletnih dokumentov. Z algoritmi in statistikami za označevanje in razvrščanje terminov v množici spletnih vsebin se spletni arhiv iz pasivne podatkovne zbirke spremeni v okolje, ki omogoča dodano vrednost povezovanja podatkov, iskanja sorodnosti znotraj podatkov spletnega arhiva in s podatki zunaj njega.Alternatív absztrakt:The Legal Deposit Act imposes to the National and University Library the concern and rights for capturing, preserving and providing access to online publications, web sites, and other content to library users. In 2015, the Library started the first capture of Slovenian .si internet domain. The domain addresses were provided by ARNES (the Academic and Research Network of Slovenia). The article focuses on the importance of covering the web content due to the deterioration of daily web domains. In addition to covering and activities to ensure the conservation of web contents, the paper also covers the subject of how to obtain information from unstructured content (documents on the web). The article shows some examples and applications to capture specific information from a variety of online documents (scraping), like the price of a selected item in a particular web store in order to inform the end user about the best offer on the market. The major part of the article is devoted to the analysis of captured web content and the possibility of scaling and ranking the vocabulary derived from web documents. The algorithms and statistics for marking and document ranking in a mass of web content can help transform the web archive from a passive database to the environment that creates the added value of data integration, finding similarities within a web archive data and the data from the outside of a web archive.
DA  - 2017///
PY  - 2017
VL  - 61
IS  - 1/2
SP  - 235
EP  - 250
LA  - Slovak
SN  - 00232424
UR  - https://search.proquest.com/docview/1966852571?accountid=27464
KW  - Web archiving
KW  - Digital preservation
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Academic libraries
KW  - Web analytics
ER  - 

TY  - JOUR
TI  - Problem archiwizacji internetu w kontekście egzemplarza obowiązkowego : sytuacja w Polsce i wybranych krajach europejskich TT - The problem of archiving the Internet in the context of a mandatory copy: the situation in Poland and selected European countri
AU  - Dąbrowska, Ewa
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - W artykule przedstawiono obowiązujące w Polsce i innych krajach przepisy prawne i przyjęte rozwiązania odnoszące się do możliwości archiwizowania zawartości internetu przez biblioteki narodowe oraz związane z tym wyzwania i problemy. Uprawnionym bibliotekom powinny być przekazywane publikacje o charakterze utworu zamieszczone w sieci. Nie jest to powszechnie przestrzegane, gdyż ta zasada nie zo­stała w polskim prawie wyraźnie wyartykułowana. Wobec rosnącego znaczenia komunikacji w środowisku cyfrowym, zwłaszcza komunikacji naukowej, w tworzonych przez biblioteki narodowe archiwach krajowego piśmiennictwa powstaje poważna luka. Problem ten powinien zostać rozwiązany w nowych przepisach o obowiązkowych egzemplarzach bibliotecznych.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951540566?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/download/523/682
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Libraries
KW  - 3.2:ARCHIVES
KW  - Europe
KW  - Poland
KW  - Regulation
ER  - 

TY  - JOUR
TI  - Recovery of vanished URLs: Comparing the efficiency of Internet Archive and Google
AU  - Kumar, D Vinay
AU  - Kumar, B T Sampath
T2  - Malaysian Journal of Library & Information Science
AB  - This article examines the vanishing nature of URLs and recovery of vanished URLs through Internet Archive and Google search engine. For that purpose study investigates the URLs cited in the articles of two LIS journals published during 2009-2013. A total of 226 articles published in two open access LIS journals were selected. Of 5197 citations cited in 226 articles, 21.05 percent were URLs (1094). Study found that 38.12 percent (417 out of 5197) URLs were found missing and remaining 61.88 percent of URLs were active at the time of URL check with W3C link checker. The HTTP 404 error message – “page not found” was the overwhelming message encountered and represented 54.2 percent of all HTTP error message. Internet Archive and Google search engine were used to recover vanished URLs. However, the Internet Archive recovered 66.19 percent of the total vanished URLs, whereas, Google manages to recover only 30.70 percent of the total vanished URLs. The recovery of vanishing URLs through Internet Archive and Google increased the active URL’s rate from 61.88 per cent to 87.11 per cent and 73.58 per cent respectively. Study found that Internet Archive is a most efficient tool to recover vanished URLs compared to Google search engine.
DA  - 2017///
PY  - 2017
VL  - 22
IS  - 2
SP  - 31
LA  - English
SN  - 1394-6234
UR  - https://search.proquest.com/docview/1925123736?accountid=27464
L4  - https://jice.um.edu.my/index.php/MJLIS/article/view/3736/1664
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Internet
KW  - Search engines
KW  - 13.14:INFORMATION STORAGE AND RETRIEVAL - SEARCHIN
KW  - URLs
ER  - 

TY  - JOUR
TI  - Archiwizacja internetu jako usługa naukowa TT - Internet archiving as a scientific service
AU  - Kugler, Anna
AU  - Beinert, Tobias
AU  - Schoger, Astrid
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - Gromadzenie i archiwizowanie stron internetowych istotnych dla nauki to jak dotąd bardzo zaniedbana sfera aktywności bibliotek niemieckich. Aby zapobiec groźnym stratom oraz zapewnić pracownikom naukowym stały dostęp do stron internetowych ponad dwa lata temu Bavarian State Library (BSB) stworzyła system archiwizacji stron internetowych. Głównym celem projektu zaakceptowanym przez German Research Foundation (DFG) był rozwój i realizacja kooperacyjnego modelu usługowego. Usługa ta ma wspierać inne instytucje dziedzictwa kulturowego w ich aktywności archiwizacyjnej i ułatwiać budowanie rozproszonego niemieckiego systemu archiwizacji naukowych stron internetowych. Dzięki temu projek­towi biblioteka bawarska chce poprawić zarówno ilość, jak i jakość zarchiwizowanych treści oraz promować ich wy­korzystanie w obszarze nauki.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951541162?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/524/676
KW  - Web archiving
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
KW  - 3.11:NATIONAL LIBRARIES AND STATE LIBRARIES
KW  - Germany
KW  - State libraries
ER  - 

TY  - JOUR
TI  - Managing Your Digital Afterlife
AU  - West, Jessamyn
T2  - Computers in Libraries
AB  - More and more, people's lives are lived online. When the author's father died 6 years ago, they were pleased to find a Google Docs file with the usernames and passwords to every account he owned. He was an engineer, so this was not terribly surprising. Most of these were things such as bank accounts and cable subscriptions, but a few were email accounts and (small) social media profiles. This made a complicated time much simpler. What if they hadn't been able to access his information? Jan Zastrow has written a great article in this issue on digital estate planning, which touches on these same ideas. In this article are some specific tech tools you can use to help archive and prepare your legacy on social media sites and in content repositories.
DA  - 2017/06//
PY  - 2017
VL  - 37
IS  - 5
SP  - 23
EP  - 25
LA  - English
SN  - 10417915
UR  - https://search.proquest.com/docview/1918332139?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Electronic documents
KW  - Digital media
KW  - Social networks
KW  - 14.11:COMMUNICATIONS AND INFORMATION TECHNOLOGY -
KW  - Electronic mail
KW  - Passwords
KW  - Repositories
KW  - Subscriptions
ER  - 

TY  - JOUR
TI  - Scaling Up Perma.cc: Ensuring the Integrity of the Digital Scholarly Record
AU  - Dulin, Kim
AU  - Ziegler, Adam
T2  - D - Lib Magazine
AB  - IMLS awarded the Harvard Library Innovation Lab a National Digital Platform grant to further develop the Lab's Perma.cc web archiving service. The funds will be used to provide technical enhancements to support an expanded user base, aid in outreach efforts to implement Perma.cc in the nation's academic libraries, and develop a commercial model for the service that will sustain the free service for the academic community. Perma.cc is a web archiving tool that puts the ability to archive a source in the hands of the author who is citing it. Once saved, Perma.cc assigns the source a new URL, which can be added to the original URL cited in the author's work, so that if the original link rots or is changed the Perma.cc URL will still lead to the original source. Perma.cc is being used widely in the legal community with great success; the IMLS grant will make the tool available to other areas of scholarship where link rot occurs and will provide a solution for those in the commercial arena who do not currently have one.
DA  - 2017///
PY  - 2017
VL  - 23
IS  - 5/6
LA  - English
SN  - 1082-9873
UR  - https://search.proquest.com/docview/1925481174?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Academic libraries
KW  - 3.2:ARCHIVES
KW  - 3.13:ACADEMIC LIBRARIES (NOT SCHOOL LIBRARIES)
KW  - Grants
ER  - 

TY  - JOUR
TI  - The Technologies of Web Archiving
AU  - Maeda, Naotoshi
AU  - Oyama, Satoshi
T2  - Joho no Kagaku to Gijutsu
AB  - In the last two decades, web archiving initiatives have spread around the world and have made substantial progresses in legislation, improvement of tools and standards, and fostering of human resources. Especially, international collaboration in the tool developments initiated by IIPC has achieved significant results that constitute the core of the web archiving technologies today. This paper shows how the tools were developed and to what extent they have been implemented into the archives, and briefly describes the mechanisms of the core technologies, Heritrix, WARC and Wayback. Furthermore, it gives an overview of full-text search tools such as NutchWAX and Solr, organization by generating metadata and Memento project which provides integrated access to open archives.
DA  - 2017/02//
PY  - 2017
VL  - 67
IS  - 2
SP  - 73
LA  - English
SN  - 0913-3801
UR  - https://search.proquest.com/docview/2007445742?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Preservation of Database on Website -Accessibility Survey for “Dnavi (Database Navigation Service)” and Sustainability of Online Databases
AU  - Kimezawa, Tsukasa
AU  - Murayama, Yasuhiro
T2  - Joho no Kagaku to Gijutsu
AB  - For publicly accessible databases on the Internet, long-term accessibility is not necessarily secured in general, because of change in contents, relocation of URL, and/or even closure of the websites. In this paper, we report on the results of accessibility survey as of April 2017 for the databases registered in the National Diet Library (NDL) Database Navigation Service (Dnavi; operated in 2002-2014), showing that web access was rejected for 22% of total 17,470 databases after three years. We discuss sustainable access to databases published on web sites, from a view point of use of NDL’s WARP (Web Archiving Project), and based on the OAIS reference model which is a standard of electronic information for long-term preservation.
DA  - 2017/09//
PY  - 2017
VL  - 67
IS  - 9
SP  - 459
LA  - English
SN  - 0913-3801
UR  - https://search.proquest.com/docview/2003801756?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Web sites
KW  - Online data bases
ER  - 

TY  - JOUR
TI  - Finding the unfound: Recovery of missing URLs through Internet Archive
AU  - Kumar, Vinay D
AU  - Sampath Kumar, B T
T2  - Annals of Library and Information Studies
AB  - The study investigated the accessibility and permanency of citations containing URLs in the articles published in DESIDOC Journal of Library and Information Technology journal during 2006-2015. A total of 2133 URL citations were identified out of which 823 were found to be incorrect or missing. HTTP-404 was the most common error message associated with the missing URLs. The study also tried to recover the incorrect or URL citations using Internet Archive and recovered a total of 484 (58.81%) missing URL citations.
DA  - 2017/09//
PY  - 2017
VL  - 64
IS  - 3
SP  - 165
LA  - English
SN  - 0972-5423
UR  - https://search.proquest.com/docview/2073135310?accountid=27464
L4  - http://nopr.niscair.res.in/bitstream/123456789/42988/1/ALIS%2064%283%29%20165-171.pdf
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Archives & records
KW  - URLs
ER  - 

TY  - JOUR
TI  - The Business of Making E-books Free
AU  - Anonymous
T2  - Publishers Weekly
AB  - The work of scanning and preparing digital editions is being done by the Internet Archive, an online repository containing 11 million books, founded by Kahle in 1996 with the goal of making as much of the world's written, visual, and audio content available for free as possible.Brand said she believes the language of the existing contracts used by the press allows it to digitize the books without seeking renewed permissions, "but out of courtesy to authors and their estates, we're reaching out for every single book."Brand said a small number of authors refused to give permission, but she added that asking them is part and parcel of the mission of the press to devise novel ways to protect the works it publishes and the authors who write them.
DA  - 2017/09/22/
PY  - 2017
VL  - 264
IS  - 39
SP  - 4
LA  - English
SN  - 00000019
UR  - https://search.proquest.com/docview/1943436218?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Digitization
KW  - Archives & records
KW  - Internet
KW  - Libraries
KW  - Copyright
KW  - E-books
ER  - 

TY  - JOUR
TI  - Technology: LC's New Born-Digital Archives
AU  - Peet, Lisa
T2  - Library Journal
AB  - Technology: LC's New Born-Digital Archives The American Folklife Center at the Library of Congress (LC) announced June 15 the creation of two new born-digital collections: the Web Cultures Web Archive (WCWA), which will feature memes, GIFs, and image macros that surface in online pop culture, and the Webcomics Web Archive (WWA), which will collect comics created for an online audience. WCWA's goal, to document the creation and sharing of web culture, means that it features such online phenomena as Lolspeak and Leet, emoji, reaction GIFs, memes, and digital urban legends--along with sites such as Urban Dictionary, Giphy, Metafilter, Cute Overload, and the LOLCat Bible Translation Project. At some point, said LC Digital Library project manager Abbie Grotke, she would like to see more content included, such as the potential for full-text search or derivative data sets--ways to help users dig deeper into the archive.
DA  - 2017/09/15/
PY  - 2017
VL  - 142
IS  - 15
SP  - 14
LA  - English
SN  - 03630277
UR  - https://search.proquest.com/docview/1937841993?accountid=27464
KW  - Web archiving
KW  - Digital curation
KW  - Archives
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Internet
KW  - Library collections
KW  - Web sites
KW  - Metadata
KW  - American culture
KW  - Folklore
KW  - Nominations
KW  - Washington DC
ER  - 

TY  - JOUR
TI  - Discovery Happens Here: PW Talks with Wikipedia's Jake Orlowitz
AU  - Anonymous
T2  - Publishers Weekly
AB  - [...]we're looking to provide a better experience for our users.[...]we're working with partners like the Internet Archive to make sure more than a million URLs are properly archived and functioning; with OCLC to make it possible to cite books automatically, via an ISBN; and with OAdoi and OAbot to make free versions of paywalled sources cited on Wikipedia accessible and easy to find.[...]our hope is that readers who engage with Wikipedia will go on to explore the full-text resources cited there, whether in books, repositories, publisher websites, or, of course, in their public or university libraries.[...]those edits must pass through machine learning bots running on increasingly sophisticated neural networks looking for common vandalism patterns, through hundreds of language-matching RegEx filters catching bad words, through thousands of human "recent change" patrollers, and through tens of thousands of people's personal article watch lists.There's been tremendous evolution and flux around everything from peer review, to article levels and alternative metrics, open access and business models, creative commons licensing, social media, you name it.
DA  - 2017/09/15/
PY  - 2017
VL  - 264
IS  - 38
SP  - 28
LA  - English
SN  - 00000019
UR  - https://search.proquest.com/docview/1940703367?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Internet
KW  - Library collections
KW  - Information literacy
KW  - E-books
KW  - Community
KW  - Essays
KW  - Librarians
KW  - Library associations
ER  - 

TY  - JOUR
TI  - "Internet Archive": la conservación de lo efímero TT - "Internet Archive": the conservation of the ephemeral
AU  - Mayagoitia, Ana
AU  - González Aguilar, Juan Manuel
T2  - Documentación de las Ciencias de la Información
AB  - The ephemeral tends to be discarded, finding little room in traditional museums or archives. The emergence of digital archives and the acceptance of a sector in academia have helped to slowly modify the perception of ephemeral content. This article aims to analyze the evolution of the Internet Archive, a digital repository specialized in the compilation and conservation of ephemeral media. To conclude, a reflection is made about the future of digital preservation and the possibility of creating similar digital archives in Spanish-speaking countries.
DA  - 2017///
PY  - 2017
DO  - http://dx.doi.org/10.5209/DCIN.57196
VL  - 40
SP  - 157
EP  - 167
LA  - Spanish
SN  - 0210-4210
UR  - https://search.proquest.com/docview/2050416699?accountid=27464
KW  - Web archiving
KW  - Archives
KW  - Digital preservation
KW  - Digital archives
KW  - Internet
KW  - Journalism
KW  - Digital archive
KW  - Ephemeral patrimony
KW  - Internet archive
KW  - Museums
KW  - Public domain
ER  - 

TY  - JOUR
TI  - For Old Times' Sake: Technostalgia's Greatest Hits
AU  - Lamphere, Carly
T2  - Online Searcher
AB  - Nostalgia is a powerful feeling/emotion. In my case, chasing childhood nostalgia caused me to lug around an almost obsolete format for years before reluctantly parting with it- but only for practical reasons. Naturally, nostalgia's strong emotional pull makes it a driving force in consumption and marketing today. Nostalgia marketing is everywhere, from foods and advertising to technology. When it comes to technology, the coined word "technostalgia" describes a "fond reminiscence of, or longing for, outdated technology" (en. wiktionary.org/wiki/technostalgia).
DA  - 2017///
PY  - 2017
VL  - 41
IS  - 5
SP  - 27
EP  - 29
LA  - English
SN  - 23249684
UR  - https://search.proquest.com/docview/1942462381?accountid=27464
KW  - Web archiving
KW  - Digital preservation
KW  - Computers--Internet
KW  - Computer & video games
KW  - 14:COMMUNICATIONS AND INFORMATION TECHNOLOGY
KW  - Consumers
KW  - Marketing
KW  - Nostalgia
KW  - Photographs
KW  - Technological obsolescence
KW  - Trends
ER  - 

TY  - JOUR
TI  - Disappearing News Archives
AU  - Davis, Sarah Jane
T2  - Online Searcher
AB  - Part of the preservation problem lies in the fact that newspapers are not official public records. According to the ProQuest title list, ProQuest News has the full text of the Milwaukee Journal Sentinel from April 1, 1995, to Dec. 31, 2009, a fraction of the full 123 years (1884-2007) formerly in Google News Archive.
DA  - 2016///
PY  - 2016
VL  - 40
IS  - 6
SP  - 46
LA  - English
SN  - 23249684
UR  - https://search.proquest.com/docview/1861822700?accountid=27464
KW  - Web archiving
KW  - Public libraries
KW  - Digital archives
KW  - Digitization
KW  - Computers--Internet
KW  - Internet
KW  - Technological obsolescence
KW  - Erdogan
KW  - Information professionals
KW  - Newspapers
KW  - Recep Tayyip
KW  - Turkey
ER  - 

TY  - JOUR
TI  - No Copies, No Comments
AU  - Padgett, Lauree
T2  - Information Today
AB  - MIA-Missing in Archives "Disappearing News Archives," an Online Searcher feature by Sarah Jane Davis, contains, in part, text Davis cites from a March 16, 2016, ResearchBuzz blog post by Tara Calishain, as well as additional comments Calishain emailed to Online Searcher editor-in-chief Marydee Ojala. There is an irreplaceable connection that comes from holding and reading an ink-lined paper, with a few crossed-out words and some smudges, or fingering a faded snapshot, yellowing and curling up at the edges, that was lovingly pressed into a page by hand, not automatically done with perfect precision via Shutterfly.
DA  - 2016/12//
PY  - 2016
VL  - 33
IS  - 10
SP  - 19
LA  - English
UR  - https://search.proquest.com/docview/1861789618?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet
KW  - Social networks
KW  - Turkey
ER  - 

TY  - JOUR
TI  - Mobilny pracownik – sprawozdanie z międzynarodowych warsztatów
AU  - Radzicka, Joanna
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951541346?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/525/674
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Toward comprehensive event collections
AU  - Nanni, Federico
AU  - Ponzetto, Simone Paolo
AU  - Dietz, Laura
T2  - International Journal on Digital Libraries
AB  - Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.
DA  - 2018/06/22/
PY  - 2018
DO  - 10.1007/s00799-018-0246-x
SN  - 1432-5012
UR  - http://link.springer.com/10.1007/s00799-018-0246-x
KW  - Web archives
KW  - Collection building
KW  - Entity query expansion
KW  - Event collections
KW  - Named events
ER  - 

TY  - JOUR
TI  - Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
AU  - Ainsworth, Scott G.
AU  - Nelson, Michael L.
T2  - International Journal on Digital Libraries
AB  - When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to <30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
DA  - 2015/06/05/
PY  - 2015
DO  - http://dx.doi.org/10.1007/s00799-014-0120-4
VL  - 16
IS  - 2
SP  - 129
EP  - 144
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/1681852984?accountid=27464
L4  - http://link.springer.com/10.1007/s00799-014-0120-4
KW  - Digital libraries
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Temporal logic
ER  - 

TY  - JOUR
TI  - Seminarium “Archiwizacja Internetu” w LaCH UW TT - Seminar "Archiving the Internet" at LaCH UW
AU  - Tokarska, Aleksandra
AU  - Wilkowski, Marcin
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - Sprawozdanie z seminarium poświęconego zagadnieniom archiwizacji Webu, zorganizowanego w siedzibie Laboratorium Cyfrowego Humanistyki UW 2 marca 2017 roku.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951539759?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/531/678
KW  - Web archiving
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
KW  - Seminars
KW  - Warsaw Poland
ER  - 

TY  - JOUR
TI  - Wayback Machine - podstawy wykorzystania TT - Wayback Machine - the basis of use
AU  - Wilkowski, Marcin
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - Autor analizuje zaprojektowane już w 1996 r., a udostępnione publicznie pięć lat później Wayback Ma­chine, które jest internetowym archiwum zasobów World Wide Web. Celem artykułu jest przedstawienie podstawowych metod wykorzystywania Wayback Machine do pracy z archiwalnymi wersjami stron www zabezpieczanych w tej usłudze. Wersja beta archiwum została udostępniona w październiku 2016 r. Fundacja Internet Archive obchodziła wówczas 20-lecie działań na polu archiwizacji webu.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951539799?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/529/683
KW  - Web archiving
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
ER  - 

TY  - JOUR
TI  - Witryna internetowa – dokumentacja czy publikacja? TT - Website - documentation or publication?
AU  - Konopa, Bartłomiej
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - W artykule podjęto rozważania teoretyczne nad charakterem witryn internetowych i próbą przypisania ich do definicji publikacji lub dokumentacji. Autor wskazuje na punkty, które mogą zostać wykorzystane w dyskusji nad tym zagadnieniem, między innymi obowiązujące normatywy, dorobek naukowy, praktykę innych państw oraz jednolite rzeczowe wykazy akt.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951539091?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/528/684
KW  - Library And Information Sciences
KW  - Web sites
KW  - 13.14:INFORMATION STORAGE AND RETRIEVAL - SEARCHIN
ER  - 

TY  - JOUR
TI  - Archiwizacja internetu – wnioski i rekomendacje z kilku raportów TT - Internet archiving - conclusions and recommendations from several reports
AU  - Derfert-Wolf, Lidia
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - W artykule omówiono trzy zagraniczne raporty dotyczące archiwizacji internetu. W materiale Web-Archiving z 2013 r. przedstawiono kluczowe problemy archiwizacji internetu, z punktu widzenia instytucji realizujących tego typu projekty, bez względu na to czy zlecają prace zewnętrznym firmom czy wykonują je we własnym zakresie. Raport Preserving Social Media, opracowany w 2016 r., dotyczy zabezpieczania zasobów mediów społecznościowych. Web Archiving Environmental Scan – stanowi analizę środowiskową, która przeprowadzono w 2015 r. na zlecenie Biblioteki Uniwersytetu Harvarda. Badaniem objęto 23 instytucje z całego świata, realizujące aktualnie tego typu projekty. W artykule przedstawiono również elementy dokumentu normalizacyjnego ISO/TR 14873:2013 Information and Documentation – Statistics and quality issues for web archiving. Na zakończenie nawiązano do prognoz dotyczących rozwoju archiwizacji internetu zaprezentowanych w raporcie Web Archives: The Future(s), opublikowanym w 2011 r.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951539109?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/532/685
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
ER  - 

TY  - JOUR
TI  - Fotografia cyfrowa i technologia 360o – zastosowanie w projektach realizowanych przez Politechnikę Wrocławską TT - Digital photography and 360o technology - applied in projects realized by Wroclaw University of Technology
AU  - Pichlak, Monika Laura
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - W artykule została krótko przytoczona historia aparatu cyfrowego oraz podstawowe różnice między fotografią cyfrową a tradycyjną. Opisano działanie studia fotograficznego 360o oraz jego wykorzysta­nie w projektach Politechniki Wrocławskiej.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951540134?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/519/675
KW  - History
KW  - Library And Information Sciences
KW  - 10.12:INFORMATION COMMUNICATION - HUMANITIES
KW  - Digital photography
KW  - Wroclaw Poland
ER  - 

TY  - JOUR
TI  - Felieton "archiwalny" – ponownie po pięciu latach
AU  - Derfert-Wolf, Lidia
AU  - Wilkowski, Marcin
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951541363?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/533/680
KW  - Library And Information Sciences
ER  - 

TY  - JOUR
TI  - Kultura brytyjskiej sieci web TT - British culture web
AU  - Cowls, Josh
T2  - Elektroniczny Biuletyn Informacyjny Bibliotekarzy : EBIB
AB  - Autor przedstawia brytyjski projekt BUDDAH, który polegał na tym, że naukowcy korzystając ze zgromadzonych zasobów archiwalnych pobranych z sieci robili humanistyczne badania naukowe. Chodzi­ło o stwierdzenie, czy jest sens w archiwizacji stron internetowych w celach badawczych. W artykule opisano wiele różnych badań, podejść metodologicznych, studiów przypadków oraz narzędzi technicznych, które stworzono, by zrealizować te badania.
DA  - 2017///
PY  - 2017
IS  - 172
SP  - 1
LA  - Polish
UR  - https://search.proquest.com/docview/1951541478?accountid=27464
L4  - http://open.ebib.pl/ojs/index.php/ebib/article/view/527/679
KW  - Web archiving
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
ER  - 

TY  - JOUR
TI  - Archives of the Americas, (Mostly) Free Online
AU  - McDermott, Irene E
T2  - Online Searcher
AB  - Established in 2008 to archive the transcribed texts of seminal documents in law, history, and diplomacy, the collection makes freely available important documents from ancient times, e.g., Agrarian Law, 111 BCE, right up to 2003, with "A Performance-Based Roadmap to a Permanent Two-State Solution to the Israeli-Palestinian Conflict." [...]visit the Digital Public Library of America (dp.la). According to Maura Marx, director of the DPLA Secretariat, "The DPLA's goal is to bring the entire nation's rich cultural collections off the shelves and into the innovative environment of the Internet for people to discover, download, remix, reuse and build on in ways we haven't yet begun to imagine" (cyber. law.harvard.edu/node/95550).
DA  - 2016///
PY  - 2016
VL  - 40
IS  - 3
SP  - 27
EP  - 29
LA  - English
SN  - 23249684
UR  - https://search.proquest.com/docview/1818627659?accountid=27464
KW  - Web archiving
KW  - Public libraries
KW  - Digital archives
KW  - Digitization
KW  - Computers--Internet
KW  - Internet
KW  - Library collections
KW  - American history
KW  - United States--US
KW  - Copyright
KW  - Museums
KW  - Photographs
KW  - Encyclopedias
KW  - Letters
KW  - Speeches
KW  - Treasuries
ER  - 

TY  - JOUR
TI  - Ether Today, Gone Tomorrow: 21st Century Sound Recording Collection in Crisis
AU  - Tsou, Judy
AU  - Vallier, John
T2  - Music Library Association. Notes
AB  - Today's music industry increasingly favors online-only, direct-to-consumer distribution. No longer can librarians expect to collect recordings on tangible media where first-sale doctrine applies. Instead, at an ever-increasing rate, librarians are discovering that music recordings are available only via such online distribution sites as iTunes or Amazon.com. These distributors require individual purchasers to agree to restrictive end-user license agreements (EULAs) that explicitly forbid institutional ownership and such core library functions as lending. What does this mean for the future of music libraries? The coauthors present an overview of an Institute of Museum and Library Services (IMLS) funded project tasked with investigating the issue, and recommend a series of next steps designed to build our professional capacity toward addressing the challenge.
DA  - 2016/03//
PY  - 2016
VL  - 72
IS  - 3
SP  - 461
EP  - 483
LA  - English
SN  - 00274380
UR  - https://search.proquest.com/docview/1761140761?accountid=27464
KW  - Web archiving
KW  - Academic libraries
KW  - Archives & records
KW  - Library collections
KW  - Cultural heritage
KW  - Librarians
KW  - Library associations
KW  - Apple iTunes
KW  - Blues music
KW  - Emergency preparedness
KW  - Motion pictures
KW  - Music libraries
KW  - Musical recordings
KW  - Online sales
KW  - Public access
KW  - Sound Recording And Reproduction
KW  - Streaming media
ER  - 

TY  - JOUR
TI  - UK Official Publications: Managing the Transition to Electronic Deposit at the British Library
AU  - Grimshaw, Jennie
T2  - Legal Information Management
AB  - This article by Jennie Grimshaw presents an overview of the transition of UK government publishing from print to electronic between the mid-1990s and 2016. It goes on to describe the tools being developed by the British Library in collaboration with the other five legal deposit libraries, to collect, preserve, organise and provide access to born digital government publications. This paradigm shift in official publishing gives the libraries a window of opportunity to improve their management of these materials and ensure that they can be found through their catalogues more easily than their print predecessors.
DA  - 2016/03//
PY  - 2016
DO  - http://dx.doi.org/10.1017/S1472669616000037
VL  - 16
IS  - 1
SP  - 3
EP  - 9
LA  - English
SN  - 14726696
UR  - https://search.proquest.com/docview/1773570753?accountid=27464
KW  - Web archiving
KW  - Legal Deposit
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Depository libraries
KW  - United Kingdom--UK
KW  - Web sites
KW  - Metadata
KW  - Legal deposit
KW  - Data bases
KW  - British Library
KW  - Departments
KW  - Government archives
KW  - Government information
KW  - Government publications
KW  - official publications
KW  - Parliaments
KW  - Publishing
ER  - 

TY  - THES
TI  - Scripts in a frame: A framework for archiving deferred representations
AU  - Brunelle, Justin F
AB  - Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival tools are unable to archive the resulting JavaScript-dependent representations (what we term deferred representations), resulting in missing or incorrect content in the archives and the general inability to replay the archived resource as it existed at the time of capture. Building on prior studies on Web archiving, client-side monitoring of events and embedded resources, and studies of the Web, we establish an understanding of the trends contributing to the increasing unarchivability of deferred representations. We show that JavaScript leads to lower-quality mementos (archived Web resources) due to the archival difficulties it introduces. We measure the historical impact of JavaScript on mementos, demonstrating that the increased adoption of JavaScript and Ajax correlates with the increase in missing embedded resources. To measure memento and archive quality, we propose and evaluate a metric to assess memento quality closer to Web users’ perception. We propose a two-tiered crawling approach that enables crawlers to capture embedded resources dependent upon JavaScript. Measuring the performance benefits between crawl approaches, we propose a classification method that mitigates the performance impacts of the two-tiered crawling approach, and we measure the frontier size improvements observed with the two-tiered approach. Using the two-tiered crawling approach, we measure the number of client-side states associated with each URI-R and propose a mechanism for storing the mementos of deferred representations. In short, this dissertation details a body of work that explores the following: why JavaScript and deferred representations are difficult to archive (establishing the term deferred representation to describe JavaScript dependent representations); the extent to which JavaScript impacts archivability along with its impact on current archival tools; a metric for measuring the quality of mementos, which we use to describe the impact of JavaScript on archival quality; the performance trade-offs between traditional archival tools and technologies that better archive JavaScript; and a two-tiered crawling approach for discovering and archiving currently unarchivable descendants (representations generated by client-side user events) of deferred representations to mitigate the impact of JavaScript on our archives. In summary, what we archive is increasingly different from what we as interactive users experience. Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web.
CY  - Ann Arbor
DA  - 2016///
PY  - 2016
SP  - 287
LA  - English
PB  - Old Dominion University
UR  - https://search.proquest.com/docview/1803306325?accountid=27464
KW  - Web archiving
KW  - Digital libraries
KW  - Digital preservation
KW  - Computer science
KW  - Library science
KW  - 0399:Library science
KW  - 0646:Web Studies
KW  - 0984:Computer science
KW  - Applied sciences
KW  - Communication and the arts
KW  - Javascript
KW  - Web crawling
KW  - Web science
KW  - Web Studies
ER  - 

TY  - JOUR
TI  - The Rosarium Project
AU  - Tryon, Julia Rachel
T2  - Digital Library Perspectives
AB  - PurposeThis paper aims to describe the Rosarium Project, a digital humanities project being undertaken at the Phillips Memorial Library + Commons of Providence College in Providence, Rhode Island. The project focuses on a collection of English language non-fiction writings about the genus Rosa. The collection will comprise books, pamphlets, catalogs and articles from popular magazines, scholarly journals and newspapers written on the rose published before 1923. The source material is being encoded using the Text Encoding Initiative (TEI) Consortium’s P5 guidelines and the extensible markup language (XML) editor software <oXygen/>.Design/methodology/approachThis paper outlines the Rosarium Project and describes its workflow. This paper demonstrates how to create TEI-encoded files for digital curation using the XML editing software <oXygen/> and the TEI Archiving Publishing and Access Service (TAPAS) Project. The paper provides information on the purpose, scope, audience and phases of the project. It also identifies the resources – hardware, software and membership – needed for undertaking such a project.FindingsThis paper shows how straightforward it is to encode transcriptions of primary sources using the TEI and XML editing software and to make the resulting digital resources available on the Web.Originality/valueThis paper presents a case study of how a research project transitioned from traditional printed bibliography to a web-accessible resource by capitalizing on the tools in the TEI toolkit using specialized XML editing software. The details of the project can be a guide for librarians and researchers contemplating digitally curating primary resources and making them available on the Web.
DA  - 2016///
PY  - 2016
DO  - http://dx.doi.org/10.1108/DLP-01-2016-0001
VL  - 32
IS  - 3
SP  - 209
EP  - 222
LA  - English
SN  - 20595816
UR  - https://search.proquest.com/docview/1887028673?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Archiving
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Books
KW  - Internet
KW  - Library collections
KW  - Research
KW  - Researchers
KW  - Internet resources
KW  - Webs
KW  - Software
KW  - Consortia
KW  - Librarians
KW  - Data bases
KW  - Newspapers
KW  - 16th century
KW  - Bibliographic literature
KW  - Coding
KW  - Collection
KW  - Editing
KW  - English language
KW  - Extensible Markup Language
KW  - Horticulture
KW  - Journals
KW  - Markup
KW  - Popular culture
KW  - Workflow
ER  - 

TY  - JOUR
TI  - A quantitative approach to evaluate Website Archivability using the CLEAR+ method
AU  - Banos, Vangelis
AU  - Manolopoulos, Yannis
T2  - International Journal on Digital Libraries
AB  - Website Archivability (WA) is a notion established to capture the core aspects of a website, crucial in diagnosing whether it has the potential to be archived with completeness and accuracy. In this work, aiming at measuring WA, we introduce and elaborate on all aspects of CLEAR+, an extended version of the Credible Live Evaluation Method for Archive Readiness (CLEAR) method. We use a systematic approach to evaluate WA from multiple different perspectives, which we call Website Archivability Facets. We then analyse archiveready.com, a web application we created as the reference implementation of CLEAR+, and discuss the implementation of the evaluation workflow. Finally, we conduct thorough evaluations of all aspects of WA to support the validity, the reliability and the benefits of our method using real-world web data.
DA  - 2016/06//
PY  - 2016
DO  - http://dx.doi.org/10.1007/s00799-015-0144-4
VL  - 17
IS  - 2
SP  - 119
EP  - 141
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/1785958458?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - Digital archives
KW  - Web sites
KW  - Data mining
KW  - Web harvesting
KW  - Website Archivability
ER  - 

TY  - JOUR
TI  - WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora
AU  - Callón, Miguel
AU  - Fdez-Glez, Jorge
AU  - Ruano-Ordás, David
AU  - Laza, Rosalía
AU  - Pavón, Reyes
AU  - Fdez-Riverola, Florentino
AU  - Méndez, Jose
T2  - Sensors
AB  - In this work we present the design and implementation of WARC Processor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background. [ABSTRACT FROM AUTHOR]
DA  - 2017/12/22/
PY  - 2017
DO  - 10.3390/s18010016
VL  - 18
IS  - 2
SP  - 16
SN  - 1424-8220
UR  - http://10.0.13.62/s18010016
L4  - http://www.mdpi.com/1424-8220/18/1/16
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archives
KW  - WEBSITES
KW  - corpus generation and maintenance
KW  - GRAPHICAL user interfaces
KW  - MICROPROCESSORS
KW  - multiple data sources
KW  - SPAM (Email)
KW  - WARC format 1.0
KW  - web spam research
ER  - 

TY  - RPRT
TI  - ArchiveSpark - MS Independent Study Final Submission
AU  - Galad, Andrej
AB  - This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios. ; IMLS LG-71-16-0037-16: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse ; NSF IIS-1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) ; NSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL) ; Included are the final report (PDF + Word), the final presentation (PPTX + PDF), the ArchiveSpark demo in the form of Jupyter Notebook, and the software developed during this project.
CY  - United States, North America
DA  - 2016///
PY  - 2016
PB  - Virginia Polytechnic Institute and State University
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://vtechworks.lib.vt.edu/handle/10919/77457
KW  - Internet Archive
KW  - WARC
KW  - Web Archiving
KW  - Big data
KW  - ArchiveSpark
KW  - CDX
KW  - GETAR
KW  - HBase
KW  - IDEAL
KW  - ILMS
KW  - Spark
ER  - 

TY  - JOUR
TI  - Efficient Topical Focused Crawling Through Neighborhood Feature
AU  - Suebchua, Tanaphol
AU  - Manaskasemsak, Bundit
AU  - Rungsawang, Arnon
AU  - Yamana, Hayato
T2  - New Generation Computing
AB  - A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the ‘‘neighborhood feature’’. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-ofthe-art focused crawlers, including HMM crawler.
DA  - 2018/04/15/
PY  - 2018
DO  - 10.1007/s00354-017-0029-8
VL  - 36
IS  - 2
SP  - 95
EP  - 118
SN  - 0288-3635
UR  - http://link.springer.com/10.1007/s00354-017-0029-8
KW  - Web archive
KW  - Domain-specific dataset
KW  - Focused crawler
KW  - Vertical search engine
ER  - 

TY  - RPRT
TI  - Universal distant reading through metadata proxies with archivespark
AU  - Holzmann, Helge
AU  - Goel, Vinay
AU  - Gustainis, Emily Novak
AB  - Digitization and the large-scale preservation of digitized content have engendered new ways of accessing and analyzing collections concurrent with other data mining and extraction efforts. Distant reading refers to the analysis of entire collections instead of close reading individual items like a single physical book or electronic document. The steps performed in distant reading are often common across various types of data collections like books, journals, or web archives, sources that are very valuable and have often been neglected as Big Data. We have extended our tool ArchiveSpark, originally designed to efficiently process Web archives, in order to support arbitrary data collections being served from either local or remote data sources by using metadata proxies. The ability to share and reuse researcher workflows across disciplines with very different datasets makes ArchiveSpark a universal distant reading framework. In this paper, we describe ArchiveSpark's design extensions along an example of how it can be leveraged to analyze symptoms of Polio mentioned in journals from the Medical Heritage Library. Our experiments demonstrate how users can reuse large portions of their job pipeline to accomplish a specific task across diverse data types and sources. Migrating an ArchiveSpark job to process a different dataset introduces an additional average code complexity of only 4.8%. Its expressiveness, scalability, extensibility, reusability, and efficiency has the potential to advance novel and rich methods of scholarly inquiry.
DA  - 2017///
PY  - 2017
SP  - 459
PB  - IEEE
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web Archives
KW  - Computing and Processing
KW  - General Topics for Engineers
KW  - Signal Processing and Analysis
KW  - Internet
KW  - Aerospace
KW  - Big Data
KW  - Bioengineering
KW  - Data mining
KW  - Digital Libraries
KW  - Distant Reading
KW  - Geoscience
KW  - Indexes
KW  - Libraries
KW  - Metadata
KW  - Tools
KW  - Transportation
ER  - 

TY  - JOUR
TI  - Government Surveillance and Declassified Documents.
AU  - Golderman, Gail
AU  - Connolly, Bruce
T2  - Library Journal
AB  - Reviews are presented for several websites, including Digital National Security Archive at www.proquest.com/productsservices/databases/dnsa.html, ProQuest History Vault: Black Freedom Struggle in the 20th Century at www.proquest.com/productsservices/historyvault.html, and Secret Files from World.
DA  - 2018/01//
PY  - 2018
VL  - 143
IS  - 1
SP  - 124
EP  - 131
SN  - 03630277
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - ARCHIVES -- Computer network resources
KW  - UNITED States. National Security Agency
KW  - WEB archives
ER  - 

TY  - JOUR
TI  - Introduction: The Web’s first 25 years
AU  - Brügger, Niels
T2  - New Media & Society
AB  - In August 2016, we can celebrate the 25th anniversary of the World Wide Web. Or can we? There is no doubt that the World Wide Web – or simply: the Web – has played an important role in the communicative infrastructure of most societies since the mid-1990s, but when did the Web actually start? And how has the Web developed from its beginning until today? The six articles in this Special Issue/section revolve around one of these questions in various ways.
DA  - 2016/08/08/
PY  - 2016
DO  - 10.1177/1461444816643787
VL  - 18
IS  - 7
SP  - 1059
EP  - 1065
SN  - 1461-4448
UR  - http://journals.sagepub.com/doi/10.1177/1461444816643787
ER  - 

TY  - JOUR
TI  - A New Online Archive of Encoded Fado Transcriptions.
AU  - VIDEIRA, TIAGO GONZAGA
AU  - ROSA, JORGE MARTINS
T2  - Empirical Musicology Review
AB  - A new online archive of encoded fado transcriptions is presented. This dataset is relevant as the first step towards a cultural heritage archive and as source material for the study of songs associated with fado practice using empirical, analytical and systematic methodologies (namely MIR techniques). It is also relevant as a source for artistic purposes, namely the creation of new songs. We detail the constitution of this symbolic music corpus and present how we conceived of and implemented a methodology for testing its internal consistency using a supervised classification system. [ABSTRACT FROM AUTHOR]
DA  - 2017/07//
PY  - 2017
VL  - 12
IS  - 3/4
SP  - 229
EP  - 243
SN  - 15595749
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archives
KW  - FADOS
KW  - INFORMATION retrieval
KW  - methodology
KW  - music information retrieval
KW  - symbolic corpus
ER  - 

TY  - CONF
TI  - Understanding the Position of Information Professionals with regards to Linked Data
AU  - McKenna, Lucy
AU  - Debruyne, Christophe
AU  - O'Sullivan, Declan
AB  - The aim of this study was to explore the benefits and challenges to using Linked Data (LD) in Libraries, Archives and Museums (LAMs) as perceived by Information Professionals (IPs). The study also aimed to gain an insight into potential solutions for overcoming these challenges. Data was collected via a questionnaire which was completed by 185 Information Professionals (IPs) from a range of LAM institutions. Results indicated that IPs find the process of integrating and interlinking LD datasets particularly challenging, and that current LD tooling does not meet their needs. The study showed that LD tools designed with the workflows and expertise of IPs in mind could help overcome these challenges.
C1  - New York, New York, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries - JCDL '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3197041
SP  - 7
EP  - 16
PB  - ACM Press
SN  - 978-1-4503-5178-2
UR  - http://dl.acm.org/citation.cfm?doid=3197026.3197041
KW  - archive
KW  - survey
KW  - Semantic Web
KW  - Linked Data
KW  - information
KW  - library
KW  - museum
KW  - professional
KW  - questionnaire
ER  - 

TY  - RPRT
TI  - Open data as political web archives : citizen involvement or reputation’s elected in a « digital public sphere » ?
AU  - Le Béchec, Mariannig
AU  - Hare, Isabelle
AB  - International audience ; The access to digital data is an economic, social and political issue. Accessibility does not only focus on the online publication of these data in a database, but as well in the discourses made by stakeholders on the web, such as the French association Regards citoyens. Since 2009, this group aggregates data of the activity of French Deputies in the French National Assembly, on the website nosdeputes.fr. In this case, political people allow the circulation of data that are arranged by actors without professional requirements unlike journalists. We are here interested in the enrichment of public data by citizens who participate in the public sphere in a form that differs from the mass media. We do not want to comment this public sphere but to describe it from the devices, the mediations that connect institutions and citizens. Therefore, we discuss the opportunity that a website like nosdeputes.fr can become the holder of a "digital public sphere" and interrogate the form of citizen oversight it induces. The frame of data on nosdeputes.fr questions the relationship between citizens, media and elected officials. On the one hand, these devices change the relationship between citizens and political action. On the other hand, we can assume that these devices bring politicians to adapt some of their practices in the French National Assembly according to the electoral agenda. We do not focus on the influence of some actors but on the oversight of citizens induced by this device. For example, nosdeputes.fr has listed activities of the 577 French Deputies since 2009. This survey provides detailed analysis of political activity in National Assembly but it is also interested in the look of the "citizen", by the comments he leaves on MPs' action.
CY  - France, Europe
DA  - 2015///
PY  - 2015
PB  - HAL CCSD
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://halshs.archives-ouvertes.fr/halshs-01180691
KW  - web archives
KW  - [ SHS.INFO ] Humanities and Social Sciences/Librar
KW  - digital public sphere
KW  - Iramuteq
KW  - open data
ER  - 

TY  - JOUR
TI  - This is the future: A reconstruction of the UK business web space (1996–2001)
AU  - Musso, Marta
AU  - Merletti, Francesco
T2  - New Media & Society
AB  - The Internet and the World Wide Web in particular have dramatically changed the way in which many companies operate. On the Web, even the smallest and most localised business has a potential global reach, and the development of online payment has redefined the selling market in most sectors. Boundaries and borders are being radically rediscussed. This article reconstructs the early approach of UK businesses to the World Wide Web between 1996 and 2001, a period in which the Web started to spread but it was not as engrained in everyday life as it would be in the following decade. While the fast and dispersed nature of the Web makes it almost impossible to accurately reconstruct the Web sphere in its historical dimension, this article proposes a methodology based on the usage of historical Web directories to access and map past Web spheres.
DA  - 2016/08/28/
PY  - 2016
DO  - 10.1177/1461444816643791
VL  - 18
IS  - 7
SP  - 1120
EP  - 1142
SN  - 1461-4448
UR  - http://journals.sagepub.com/doi/10.1177/1461444816643791
KW  - Internet Archive
KW  - World Wide Web
KW  - Directories
KW  - management
KW  - marketing
KW  - online
KW  - online business services
KW  - sales
KW  - UK business
KW  - web scraper
ER  - 

TY  - CONF
TI  - RDF-Gen
AU  - Santipantakis, Georgios M.
AU  - Kotis, Konstantinos I.
AU  - Vouros, George A.
AU  - Doulkeridis, Christos
AB  - Recent state-of-the-art approaches and technologies for generating RDF graphs from non-RDF data, use languages designed for specifying transformations or mappings to data of various kinds of format. This paper presents a new approach for the generation of ontology-annotated RDF graphs, linking data from multiple heterogeneous streaming and archival data sources, with high throughput and low latency. To support this, and in contrast to existing approaches, we propose embedding in the RDF generation process a close-to-sources data processing and linkage stage, supporting the fast template-driven generation of triples in a subsequent stage. This approach, called RDF-Gen, has been implemented as a SPARQL-based RDF generation approach. RDF-Gen is evaluated against the latest related work of RML and SPARQL-Generate, using real world datasets.
C1  - New York, New York, USA
C3  - Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics - WIMS '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3227609.3227658
SP  - 1
EP  - 10
PB  - ACM Press
SN  - 978-1-4503-5489-9
UR  - http://dl.acm.org/citation.cfm?doid=3227609.3227658
KW  - data-to-RDF mapping
KW  - RDF generation
KW  - RDF knowledge graph
ER  - 

TY  - CONF
TI  - A Framework for Aggregating Private and Public Web Archives
AU  - Kelly, Mat
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
AB  - Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.
C1  - New York, New York, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries - JCDL '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3197045
SP  - 273
EP  - 282
PB  - ACM Press
SN  - 978-1-4503-5178-2
UR  - http://dl.acm.org/citation.cfm?doid=3197026.3197045
KW  - web archiving
KW  - memento
KW  - privacy
KW  - personalization
ER  - 

TY  - CONF
TI  - Robust Links in Scholarly Communication
AU  - Klein, Martin
AU  - Shankar, Harihar
AU  - de Sompel, Herbert
AB  - Web resources change over time and many ultimately disappear. While this has become an inconvenient reality in day-to-day use of the web, it is problematic when these resources are referenced in scholarship where it is expected that referenced materials can reliably be revisited. We introduce Robust Links, an approach aimed at maintaining the integrity of the scholarly record in a dynamic web environment. The approach consists of archiving web resources when referencing them and decorating links to convey information that supports accessing referenced resources both on the live web and in web archives.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3203885
SP  - 357
EP  - 358
PB  - ACM
SN  - 978-1-4503-5178-2
UR  - http://doi.acm.org/10.1145/3197026.3203885
KW  - web archiving
KW  - content drift
KW  - link rot
KW  - persistence
KW  - persistent identifiers
KW  - scholarly communication
ER  - 

TY  - CONF
TI  - Bootstrapping Web Archive Collections from Social Media
AU  - Nwala, Alexander C.
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
AB  - Human-generated collections of archived web pages are expensive to create, but provide a critical source of information for researchers studying historical events. Hand-selected collections of web pages about events shared by users on social media offer the opportunity for bootstrapping archived collections. We investigated if collections generated automatically and semi-automatically from social media sources such as Storify, Reddit, Twitter, and Wikipedia are similar to Archive-It human-generated collections. This is a challenging task because it requires comparing collections that may cater to different needs. It is also challenging to compare collections since there are many possible measures to use as a baseline for collection comparison: how does one narrow down this list to metrics that reflect if two collections are similar or dissimilar? We identified social media sources that may provide similar collections to Archive-It human-generated collections in two main steps. First, we explored the state of the art in collection comparison and defined a suite of seven measures (Collection Characterizing Suite - CCS) to describe the individual collections. Second, we calculated the distances between the CCS vectors of Archive-It collections and the CCS vectors of collections generated automatically and semi-automatically from social media sources, to identify social media collections most similar to Archive-It collections. The CCS distance comparison was done for three topics: "Ebola Virus," "Hurricane Harvey," and "2016 Pulse Nightclub Shooting." Our results showed that social media sources such as Reddit, Storify, Twitter, and Wikipedia produce collections that are similar to Archive-It collections. Consequently, curators may consider extracting URIs from these sources in order to begin or augment collections about various news topics.
C1  - New York, New York, USA
C3  - Proceedings of the 29th on Hypertext and Social Media - HT '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3209542.3209560
SP  - 64
EP  - 72
PB  - ACM Press
SN  - 978-1-4503-5427-1
UR  - http://dl.acm.org/citation.cfm?doid=3209542.3209560
L1  - https://dl.acm.org/doi/pdf/10.1145/3209542.3209560
KW  - web archiving
KW  - social media
KW  - Web Archiving
KW  - Collection evaluation
KW  - News
KW  - Social Media
KW  - news
KW  - collection evaluation
ER  - 

TY  - CONF
TI  - Micro Archives as Rich Digital Object Representations
AU  - Holzmann, Helge
AU  - Runnwerth, Mila
AB  - Digital objects as well as real-world entities are commonly referred to in literature or on the Web by mentioning their name, linking to their website or citing unique identifiers, such as DOI and OR- CID, which are backed by a set of meta information. All of these methods have severe disadvantages and are not always suitable though: They are not very precise, not guaranteed to be persistent or mean a big additional effort for the author, who needs to collect the metadata to describe the reference accurately. Especially for complex, evolving entities and objects like software, pre-defined metadata schemas are often not expressive enough to capture its temporal state comprehensively. We found in previous work that a lot of meaningful information about software, such as a description, rich metadata, its documentation and source code, is usually avail- able online. However, all of this needs to be preserved coherently in order to constitute a rich digital representation of the entity. We show that this is currently not the case, as only 10% of the stud- ied blog posts and roughly 30% of the analyzed software websites are archived completely, i.e., all linked resources are captured as well. Therefore, we propose Micro Archives as rich digital object representations, which semantically and logically connect archived resources and ensure a coherent state. With Micrawler we present a modular solution to create, cite and analyze such Micro Archives. In this paper, we show the need for this approach as well as discuss opportunities and implications for various applications also beyond scholarly writing.
C1  - New York, New York, USA
C3  - Proceedings of the 10th ACM Conference on Web Science - WebSci '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3201064.3201110
SP  - 353
EP  - 357
PB  - ACM Press
SN  - 978-1-4503-5563-6
UR  - http://dl.acm.org/citation.cfm?doid=3201064.3201110
KW  - Crawling
KW  - Data Representation
KW  - Scientific Workflow
KW  - Web Archives
ER  - 

TY  - CONF
TI  - Poster: Predicting Website Abuse Using Update Histories
AU  - Takata, Yuta
AU  - Akiyama, Mitsuaki
AU  - Yagi, Takeshi
AU  - Hato, Kunio
AU  - Goto, Shigeki
AB  - Threats of abusing websites that webmasters have stopped updating have increased. In this poster, we propose a method of predicting potentially abusable websites by retrospectively analyzing updates of software that composes websites. The method captures webmaster behaviors from archived snapshots of a website and analyzes the changes of web servers and web applications used in the past as update histories. A classifier that predicts website abuses is finally built by using update histories from snapshots of known malicious websites before the detections. Evaluation results showed that the classifier could predict various website abuses, such as drive-by downloads, phishes, and defacements, with accuracy: a 76% true positive rate and a 26% false positive rate.
C1  - New York, New York, USA
C3  - Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3184558.3186903
SP  - 9
EP  - 10
PB  - ACM Press
SN  - 978-1-4503-5640-4
UR  - http://dl.acm.org/citation.cfm?doid=3184558.3186903
KW  - Internet Archive
KW  - CMS
KW  - Software Update
KW  - Website Abuse
ER  - 

TY  - CONF
TI  - Acquiring Web Content From In-Memory Cache
AU  - Kumar, Abhinav
AU  - Xie, Zhiwu
AB  - Web content acquisition forms the foundation of value extraction of web data. Two main categories of acquisition methods are crawler based methods and transactional web archiving or server-side acquisition methods. In this poster, we propose a new method to acquire web content from web caches. Our method provides improvement in terms of reduced penalty on HTTP transaction, flexibility to accommodate peak web server loads and minimal involvement of System Administrator to set up the system.
C1  - New York, New York, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries - JCDL '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3203868
SP  - 359
EP  - 360
PB  - ACM Press
SN  - 978-1-4503-5178-2
UR  - http://dl.acm.org/citation.cfm?doid=3197026.3203868
KW  - Web archiving
KW  - in-memory cache
KW  - Memcached
ER  - 

TY  - CONF
TI  - How it Happened
AU  - Alonso, Omar
AU  - Kandylas, Vasileios
AU  - Tremblay, Serge-Eric
AB  - Social networks like Twitter and Facebook are the largest sources of public opinion and real-time information on the Internet. If an event is of general interest, news articles follow and eventually a Wikipedia page. We propose the problem of automatic event story generation and archiving by combining social and news data to construct a new type of document in the form of a Wiki-like page structure. We introduce a technique that shows the evolution of a story as perceived by the crowd in social media, along with editorially authored articles annotated with examples of social media as supporting evidence. At the core of our research, is the temporally sensitive extraction of data that serve as context for retrieval purposes. Our approach includes a fine-grained vote counting strategy that is used for weighting purposes, pseudo-relevance feedback and query expansion with social data and web query logs along with a timeline algorithm as the base for a story. We demonstrate the effectiveness of our approach by processing a dataset comprising millions of English language tweets generated over a one year period and present a full implementation of our system.
C1  - New York, New York, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries - JCDL '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3197034
SP  - 193
EP  - 202
PB  - ACM Press
SN  - 978-1-4503-5178-2
UR  - http://dl.acm.org/citation.cfm?doid=3197026.3197034
ER  - 

TY  - CONF
TI  - Rewriting History
AU  - Lerner, Ada
AU  - Kohno, Tadayoshi
AU  - Roesner, Franziska
AB  - The Internet Archive's Wayback Machine is the largest modern web archive, preserving web content since 1996. We discover and analyze several vulnerabilities in how the Wayback Machine archives data, and then leverage these vulnerabilities to create what are to our knowledge the first attacks against a user's view of the archived web. Our vulnerabilities are enabled by the unique interaction between the Wayback Machine's archives, other websites, and a user's browser, and attackers do not need to compromise the archives in order to compromise users' views of a stored page. We demonstrate the effectiveness of our attacks through proof-of-concept implementations. Then, we conduct a measurement study to quantify the prevalence of vulnerabilities in the archive. Finally, we explore defenses which might be deployed by archives, website publishers, and the users of archives, and present the prototype of a defense for clients of the Wayback Machine, ArchiveWatcher.
C1  - New York, New York, USA
C3  - Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security - CCS '17
DA  - 2017///
PY  - 2017
DO  - 10.1145/3133956.3134042
SP  - 1741
EP  - 1755
PB  - ACM Press
SN  - 978-1-4503-4946-8
UR  - http://dl.acm.org/citation.cfm?doid=3133956.3134042
KW  - web archives
KW  - web security
ER  - 

TY  - CONF
TI  - Challenges and Opportunities within Personal Life Archives
AU  - Dang-Nguyen, Duc-Tien
AU  - Riegler, Michael
AU  - Zhou, Liting
AU  - Gurrin, Cathal
AB  - Nowadays, almost everyone holds some form or other of a personal life archive. Automatically maintaining such an archive is an activity that is becoming increasingly common, however without automatic support the users will quickly be overwhelmed by the volume of data and will miss out on the potential benefits that lifelogs provide. In this paper we give an overview of the current status of lifelog research and propose a concept for exploring these archives. We motivate the need for new methodologies for indexing data, organizing content and supporting information access. Finally we will describe challenges to be addressed and give an overview of initial steps that have to be taken, to address the challenges of organising and searching personal life archives.
C1  - New York, New York, USA
C3  - Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval - ICMR '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3206025.3206040
SP  - 335
EP  - 343
PB  - ACM Press
SN  - 978-1-4503-5046-4
UR  - http://dl.acm.org/citation.cfm?doid=3206025.3206040
KW  - Lifelogging
KW  - Personal Life Archive
KW  - Search Engine
ER  - 

TY  - JOUR
TI  - Warcbase
AU  - Lin, Jimmy
AU  - Milligan, Ian
AU  - Wiebe, Jeremy
AU  - Zhou, Alice
T2  - Journal on Computing and Cultural Heritage
AB  - Web archiving initiatives around the world capture ephemeral Web content to preserve our collective digital memory. However, unlocking the potential of Web archives for humanities scholars and social scientists requires a scalable analytics infrastructure to support exploration of captured content. We present Warcbase, an open-source Web archiving platform that aims to fill this need. Our platform takes advantage of modern open-source “big data” infrastructure, namely Hadoop, HBase, and Spark, that has been widely deployed in industry. Warcbase provides two main capabilities: support for temporal browsing and a domain-specific language that allows scholars to interrogate Web archives in several different ways. This work represents a collaboration between computer scientists and historians, where we have engaged in iterative codesign to build tools for scholars with no formal computer science training. To provide guidance, we propose a process model for scholarly interactions with Web archives that begins with a question and proceeds iteratively through four main steps: filter, analyze, aggregate, and visualize. We call this the FAAV cycle for short and illustrate with three prototypical case studies. This article presents the current state of the project and discusses future directions.
DA  - 2017/07/31/
PY  - 2017
DO  - 10.1145/3097570
VL  - 10
IS  - 4
SP  - 1
EP  - 30
SN  - 15564673
UR  - http://dl.acm.org/citation.cfm?doid=3129537.3097570
KW  - WARC
KW  - Apache Hadoop
KW  - Apache HBase
KW  - Apache Spark
KW  - ARC
KW  - Big data
ER  - 

TY  - CONF
TI  - A Baseline Search Engine for Personal Life Archives
AU  - Zhou, Liting
AU  - Dang-Nguyen, Duc-Tien
AU  - Gurrin, Cathal
AB  - In lifelogging, as the volume of personal life archive data is ever increasing, we have to consider how to take advantage of a tool to extract or exploit valuable information from these personal life archives. In this work we motivate the need for, and present, a baseline search engine for personal life archives, which aims to make the personal life archive searchable, organizable and easy to be updated. We also present some preliminary results, which illustrate the feasibility of the baseline search engine as a tool for getting insights from personal life archives.
C1  - New York, New York, USA
C3  - Proceedings of the 2nd Workshop on Lifelogging Tools and Applications - LTA '17
DA  - 2017///
PY  - 2017
DO  - 10.1145/3133202.3133206
SP  - 21
EP  - 24
PB  - ACM Press
SN  - 978-1-4503-5503-2
UR  - http://dl.acm.org/citation.cfm?doid=3133202.3133206
KW  - Lifelogging
KW  - Personal Life Archive
KW  - Search Engine
ER  - 

TY  - JOUR
TI  - Ethical Challenges and Current Practices in Activist Social Media Archives
AU  - Velte, Ashlyn
T2  - The American Archivist
AB  - Social media (Web applications supporting communication between Internet users) empower current activist groups to create records of their activities. Recent digital collections, such as the digital archives of the Occupy Wall Street movement and the Documenting Ferguson Project, demonstrate archival interest in preserving and providing access to activist social media. Literature describing current practices exists for related topics such as Web and social media archives, privacy and access for digital materials, and activist archives. However, research on activist social media archives is scarce. These materials likely present subject- and format-specific challenges not yet identified in peer-reviewed research. Using a survey and semistructured interviews with archivists who collect activist social media, this study describes ethical challenges regarding acquisition and access. Specifically, the respondents were concerned about acquiring permission to collect and provide long-term access to activist groups' social media. When collecting social media as data sets, archivists currently intend to provide moderated access to the archives, whereas when dealing with social media accounts, archivists intend to seek permission to collect from the activist groups and provide access online. These current practices addressing ethical issues may serve as models for other institutions interested in collecting social media from activists. Understanding how to approach activist social media ethically decreases the risk th a t these important records of modern activism will be left out of the historical narrative. [ABSTRACT FROM AUTHOR]
DA  - 2018/03//
PY  - 2018
DO  - 10.17723/0360-9081-81.1.112
VL  - 81
IS  - 1
SP  - 112
EP  - 134
SN  - 0360-9081
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://americanarchivist.org/doi/10.17723/0360-9081-81.1.112
KW  - Web archives
KW  - Archival theory and principles
KW  - Copyright and intellectual property
KW  - Digital preservation
KW  - Ethics
KW  - Privacy and confidentiality
KW  - Social media archives
ER  - 

TY  - JOUR
TI  - The colors of the national Web: visual data analysis of the historical Yugoslav Web domain
AU  - Ben-David, Anat
AU  - Amram, Adam
AU  - Bekkerman, Ron
T2  - International Journal on Digital Libraries
AB  - This study examines the use of visual data analytics as a method for historical investigation of national Webs, using Web archives. It empirically analyzes all graphically designed (non-photographic) images extracted from Websites hosted in the historical .yu domain and archived by the Internet Archive between 1997 and 2000, to assess the utility and value of visual data analytics as a measure of nationality of a Web domain. First, we report that only 23.5% of Websites hosted in the .yu domain over the studied years had their graphically designed images properly archived. Second, we detect significant differences between the color palettes of .yu sub-domains (commercial, organizational, academic, and governmental), as well as between Montenegrin and Serbian Websites. Third, we show that the similarity of the domains’ colors to the colors of the Yugoslav national flag decreases over time. However, there are spikes in the use of Yugoslav national colors that correlate with major developments on the Kosovo frontier.
DA  - 2018/03/18/
PY  - 2018
DO  - 10.1007/s00799-016-0202-6
VL  - 19
IS  - 1
SP  - 95
EP  - 106
SN  - 1432-5012
UR  - http://link.springer.com/10.1007/s00799-016-0202-6
KW  - Internet Archive
KW  - Web archives
KW  - analytics
KW  - Color analysis
KW  - National domain
KW  - Visual data
KW  - Yugoslavia
ER  - 

TY  - JOUR
TI  - Erasing history. (Cover story)
AU  - Bustillos, Maria
AU  - Freshwater, Shannon
T2  - Columbia Journalism Review
AB  - The article discusses the digital journalism focusing on the failure of an online news outlet "The Honolulu Advertiser". The author discusses the history of digital archiving systems, role played by U.S. government in protecting digital archival documents, and technological innovations that protects internet archives.
DA  - 2018///
PY  - 2018
VL  - 57
IS  - 1
SP  - 112
EP  - 118
SN  - 0010194X
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.cjr.org/special_report/microfilm-newspapers-media-digital.php/
KW  - WEB archives
KW  - ARCHIVES -- United States
KW  - BUSINESS failures
KW  - HISTORY
KW  - HONOLULU Advertiser (Newspaper)
KW  - NEWS websites
KW  - ONLINE journalism
KW  - TECHNOLOGICAL innovations in journalism
ER  - 

TY  - JOUR
TI  - A query language for multi-version data web archives
AU  - Meimaris, Marios
AU  - Papastefanatos, George
AU  - Viglas, Stratis
AU  - Stavrakas, Yannis
AU  - Pateritsas, Christos
AU  - Anagnostopoulos, Ioannis
T2  - Expert Systems
AB  - The Data Web refers to the vast and rapidly increasing quantity of scientific, corporate, government and crowd-sourced data published in the form of Linked Open Data, which encourages the uniform representation of heterogeneous data items on the web and the creation of links between them. The growing availability of open linked datasets has brought forth significant new challenges regarding their proper preservation and the management of evolving information within them. In this paper, we focus on the evolution and preservation challenges related to publishing and preserving evolving linked data across time. We discuss the main problems regarding their proper modelling and querying and provide a conceptual model and a query language for modelling and retrieving evolving data along with changes affecting them. We present in details the syntax of the query language and demonstrate its functionality over a real-world use case of evolving linked dataset from the biological domain. [ABSTRACT FROM AUTHOR]
DA  - 2016/08//
PY  - 2016
DO  - 10.1111/exsy.12157
VL  - 33
IS  - 4
SP  - 383
EP  - 404
SN  - 02664720
UR  - http://10.0.4.87/exsy.12157
L4  - http://doi.wiley.com/10.1111/exsy.12157
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - ARCHIVES
KW  - WEB archives
KW  - archiving
KW  - CROWDSOURCING
KW  - data evolution
KW  - Data Web
KW  - HETEROGENEOUS computing
KW  - INFORMATION visualization
KW  - LINKED data (Semantic Web)
KW  - linked data preservation
KW  - QUERY languages (Computer science)
ER  - 

TY  - JOUR
TI  - A historian's view on the right to be forgotten
AU  - De Baets, Antoon
T2  - International Review of Law, Computers & Technology
AB  - This essay explores the consequences for historians of the ‘right to be forgotten', a new concept proposed by the European Commission in 2012. I first explain that the right to be forgotten is a radical variant of the right to privacy and clarify the consequences of the concept for the historical study of public and private figures. I then treat the hard cases of spent and amnestied convictions and of internet archives. I further discuss the applicability of the right to be forgotten to dead persons as part of the problem of posthumous privacy, and finally point to the ambiguity of the impact of the passage of time. While I propose some compromise solutions, I also conclude that a generalized right to be forgotten would lead to the rewriting of history in ways that impoverish our insights not only into anecdotal lives but also into the larger trends of history. [ABSTRACT FROM AUTHOR]
DA  - 2016/01/02/
PY  - 2016
DO  - 10.1080/13600869.2015.1125155
VL  - 30
IS  - 1-2
SP  - 57
EP  - 66
SN  - 1360-0869
UR  - http://10.0.4.56/13600869.2015.1125155
L4  - https://www.tandfonline.com/doi/full/10.1080/13600869.2015.1125155
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archives
KW  - privacy
KW  - right to be forgotten
KW  - amnesty
KW  - DATA protection laws
KW  - EUROPEAN Commission
KW  - internet archives
KW  - passage of time
KW  - PERSONALLY identifiable information
KW  - posthumous privacy
KW  - private and public figures
KW  - RIGHT of privacy
KW  - RIGHT to be forgotten
KW  - right to forget
KW  - spent convictions
ER  - 

TY  - JOUR
TI  - Web Archives for the Analog Archivist: Using Webpages Archived by the Internet Archive to Improve Processing and Description
AU  - Gelfand, Aleksandr
T2  - Journal of Western Archives
AB  - Twenty years ago the Internet Archive was founded with the wide-ranging mission of providing universal access to all knowledge. In the two decades since, that organization has captured and made accessible over 150 billion websites. By incorporating the use of Internet Archive's Wayback Machine into their workflows, archivists working primarily with analog records may enhance their ability in such tasks as the construction of a processing plan, the creation of more accurate historical descriptions for finding aids, and potentially be able to provide better reference services to their patrons. This essay will look at some of the ways this may be accomplished.
DA  - 2018///
PY  - 2018
VL  - 9
IS  - 1
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Internet Archive
KW  - Web archives
KW  - Archives--Processing
KW  - Archives--Reference services
KW  - Description
KW  - Processing
KW  - Wayback Machine
ER  - 

TY  - JOUR
TI  - Methods and Approaches to Using Web Archives in Computational Communication Research
AU  - Weber, Matthew S.
T2  - Communication Methods and Measures
AB  - This article examines the role of web archives as a critical source of data for conducting computational communication research. Web archives are large-scale databases containing comprehensive records of websites showing how those websites have evolved over time. Recent communication scholarship using web archives is reviewed, demonstrating the breadth of research conducted in this space. Subsequently, a methodological framework is proposed for using web archives in computational communication research. As a source of data, web archives present a number of methodological challenges, particularly with regards to the accuracy and completeness of web archives. These problems are addressed in order to better inform future work in this area. The closing sections outline a forward-looking trajectory for computational communication research using web archives.
DA  - 2018/04/03/
PY  - 2018
DO  - 10.1080/19312458.2018.1447657
VL  - 12
IS  - 2-3
SP  - 200
EP  - 215
SN  - 1931-2458
UR  - https://www.tandfonline.com/doi/full/10.1080/19312458.2018.1447657
ER  - 

TY  - JOUR
TI  - A registry of archived electronic journals
AU  - Sparks, Sue
AU  - Look, Hugh
AU  - Bide, Mark
AU  - Muir, Adrienne
T2  - Journal of Librarianship and Information Science
DA  - 2010/06/07/
PY  - 2010
DO  - 10.1177/0961000610361552
VL  - 42
IS  - 2
SP  - 111
EP  - 121
SN  - 0961-0006
UR  - http://journals.sagepub.com/doi/10.1177/0961000610361552
ER  - 

TY  - JOUR
TI  - Website history and the website as an object of study
AU  - Brügger, Niels
T2  - New Media & Society
DA  - 2009///
PY  - 2009
DO  - 10.1177/1461444808099574
VL  - 11
IS  - 1-2
SP  - 115
EP  - 132
SN  - 1461-4448
UR  - http://journals.sagepub.com/doi/10.1177/1461444808099574
ER  - 

TY  - JOUR
TI  - The Web-at-Risk at Three: Overview of an NDIIPP Web Archiving Initiative
AU  - Tracy Seneca
T2  - Library Trends
AB  - The Web-at-Risk project is a multi-year National Digital Information Infrastructure and Preservation Program (NDIIPP) funded effort to enable librarians and archivists to capture, curate, and preserve political and government information on the Web, and to make the resulting Web archives available to researchers. The Web-at-Risk project is a collaborative effort between the California Digital Library, New York University Libraries, the Stanford School of Computer Sci- ence, and the University of North Texas Libraries. Web-at-Risk is a multifaceted project that involves software development, integration of open-source solutions, and extensive needs assessment and collec- tion planning work with the project’s curatorial partners. A major outcome of this project is the Web Archiving Service (WAS), a Web archiving curatorial tool developed at the California Digital Library. This paper will examine both the Web-at-Risk project overall, how Web archiving fits into existing collection development practices, and the Web Archiving Ser vice workflow, features, and technical approach. Issues addressed will include how the reliance on exist- ing technologies both benefited and hindered the project, and how curator feedback shaped WAS design. Finally, the challenges faced and future directions for the project will be examined
DA  - 2009///
PY  - 2009
DO  - 10.1353/lib.0.0045
VL  - 57
IS  - 3
SP  - 427
EP  - 441
SN  - 1559-0682
UR  - http://muse.jhu.edu/content/crossref/journals/library_trends/v057/57.3.seneca.html
ER  - 

TY  - JOUR
TI  - Archiving and Analysing Techniques of the Ultra-Large-Scale Web-Based Corpus Project of NINJAL, Japan
AU  - Asahara, Masayuki
AU  - Maekawa, Kikuo
AU  - Imada, Mizuho
AU  - Kato, Sachi
AU  - Konishi, Hikari
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - In 2011, the National Institute for Japanese Language and Linguistics (NINJAL) launched a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words by 2016. The project is divided into four categories: Page Collection, Linguistic Annotation, Release and Preservation. For Page Collection, web crawlers are employed to collect web text by crawling 100 million pages every three months and retaining several versions of the text for three-month periods. For Linguistic Annotation, the linguistic studies web corpus contains annotated linguistic information. To improve the usability of these linguistic resources, normalization tasks such as tag removal, word segmentation, dependency parsing, and register estimation are performed. For Release, word lists and n-gram data are published based on the crawled and annotated text corpus. In addition, applications are being developed to enable searching for morphosyntax patterns in the ten-billion-word corpus. For Preservation, crawled web pages are preserved in chronological order as web archives primarily to support the survey of ongoing linguistic changes. In this paper, we present the basic design of the four categories. Additionally, we report the current status of the corpus using basic statistics of the crawled data and discuss the importance of deduplicating sentences. [ABSTRACT FROM AUTHOR]
DA  - 2014/08//
PY  - 2014
DO  - 10.7227/ALX.0024
VL  - 25
IS  - 1-2
SP  - 129
EP  - 148
SN  - 0955-7490
UR  - http://10.0.28.59/ALX.0024
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0024
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Corpora (Linguistics)
KW  - crawling
KW  - web archive
KW  - Japan
KW  - Japanese language
KW  - Japanese language resources
KW  - Language digital resources
KW  - linguistic annotation
KW  - web corpus
ER  - 

TY  - JOUR
TI  - Web Crawling
AU  - Olston, Christopher
AU  - Najork, Marc
T2  - Foundations and Trends® in Information Retrieval
DA  - 2010///
PY  - 2010
DO  - 10.1561/1500000017
VL  - 4
IS  - 3
SP  - 175
EP  - 246
SN  - 1554-0669
UR  - http://www.nowpublishers.com/article/Details/INR-017
ER  - 

TY  - JOUR
TI  - Legal Issues Related to Whole-of-Domain Web Harvesting in Australia
AU  - Simes, Laura
AU  - Pymm, Bob
T2  - Journal of Web Librarianship
AB  - Selective archiving of Web sites in Australia has been under way since 1996. This approach has seen carefully selected sites preserved after site owners granted permission. The labor-intensive nature of this process means only a small number of sites can ever be acquired in this manner. An alternate approach is an automated “whole-of-domain” capture of sites, which has been undertaken in a number of countries, including Australia. This article considers the existing legal position in taking this approach and looks at how legal deposit and copyright legislation constrains the process. It also considers recent amendments to the Copyright Act to provide more flexibility along the lines of the U.S. fair-use approach and the possible impact these new provisions may have for those involved with large-scale Web archiving in Australia
DA  - 2009/06/23/
PY  - 2009
DO  - 10.1080/19322900902787227
VL  - 3
IS  - 2
SP  - 129
EP  - 142
SN  - 1932-2909
UR  - http://www.tandfonline.com/doi/abs/10.1080/19322900902787227
KW  - legal deposit
KW  - digital preservation
KW  - Web harvesting
KW  - copyright
KW  - Internet archiving
KW  - fair use
ER  - 

TY  - JOUR
TI  - Legal deposit and collection development in a digital world
AU  - Joint, Nicholas
T2  - Library Review
AB  - Purpose – To compare and contrast national collection management principles for hard copy deposit collections and for digital deposit collections. Design/methodology/approach – A selective overview and summary of work to date on digital legal deposit and digital preservation. Findings – That the comprehensive nature of traditional print deposit collection often absolves national libraries from the more intractable problems of stock selection; whereas the difficulty of collecting the entire national digital web space means that intelligent selection is vital for the building of meaningful digital deposit collections. Research limitations/implications – These are indicative and partial insights based on small scale interrogation of trial digital deposit collections: the issue of collection development and selection biases in digital collection building needs greater in-depth research before hard and fast recommendations about collection management criteria can be arrived at. Practical implications – The principles outlined may offer practitioners in national libraries some useful insights into how to manage their digital deposit collections. Originality/value – This paper emphasises the social and political aspects of digital deposit issues, rather than the legal or technical aspects.
DA  - 2006/10//
PY  - 2006
DO  - 10.1108/00242530610689310
VL  - 55
IS  - 8
SP  - 468
EP  - 473
SN  - 0024-2535
UR  - https://www.emeraldinsight.com/doi/10.1108/00242530610689310
KW  - Digital libraries
KW  - National libraries
KW  - Collections management
ER  - 

TY  - BOOK
TI  - Entity Extraction and Consolidation for Social Web Content Preservation
AU  - Dietze, Stefan
AU  - Maynard, Diana
AU  - Demidova, Elena
AU  - Risse, Thomas
AU  - Peters, Wim
AU  - Doka, Katerina
AU  - Stavrakas, Yannis
AB  - With the rapidly increasing pace at which Web content is evolving, particularly social media, preserving the Web and its evolution over time becomes an important challenge. Meaningful analysis of Web content lends itself to an entity-centric view to organise Web resources according to the information objects related to them. Therefore, the crucial challenge is to extract, detect and correlate entities from a vast number of heterogeneous Web resources where the nature and quality of the content may vary heavily. While a wealth of information extraction tools aid this process, we believe that, the consolidation of automatically extracted data has to be treated as an equally important step in order to ensure high quality and non-ambiguity of generated data. In this paper we present an approach which is based on an iterative cycle exploiting Web data for (1) targeted archiving/crawling of Web objects, (2) entity extraction, and detection, and (3) entity correlation. The long-term goal is to preserve Web content over time and allow its navigation and analysis based on well-formed structured RDF data about entities.
CY  - United States, North America
DA  - 2012///
PY  - 2012
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L1  - http://ceur-ws.org/Vol-912/paper1.pdf
L1  - http://stefandietze.files.wordpress.com/2009/01/dietze-et-al-entity-sda-2012-camera-ready.pdf
KW  - Web Archiving
KW  - Data Consolidation
KW  - Data Enrichment
KW  - Entity Recognition
KW  - Linked Data
ER  - 

TY  - JOUR
TI  - Role and justification of web archiving by national libraries
AU  - Shiozaki, Ryo
AU  - Eisenschitz, Tamara
T2  - Journal of Librarianship and Information Science
AB  - This paper reports on a questionnaire survey of 16 national libraries designed to clarify how national libraries attempt to justify their web archiving activities. Results indicate they envisage that a) the benefits brought about by their initiatives are greater than the overall costs, b) the costs imposed on libraries are greater than the costs imposed on stakeholders, and c) all of them are making efforts to respond to legal risks in various ways (e.g. legislation, contracting and opt-out policies) although there are trade-off relations in terms of costs for negotiation, scope of access and size and scope of the web archive. The paper discusses whether a basic logic for justification of their web archiving is valid from the perspective of balancing cost—benefit. Further, it highlights the potential, underlying premises of the logic that motivates the intervention of national libraries as public sector organizations.
DA  - 2009/06/20/
PY  - 2009
DO  - 10.1177/0961000609102831
VL  - 41
IS  - 2
SP  - 90
EP  - 107
SN  - 0961-0006
UR  - http://journals.sagepub.com/doi/10.1177/0961000609102831
KW  - web archiving
KW  - national libraries
KW  - cost—benefits
KW  - legal risks
KW  - questionnaire survey
ER  - 

TY  - JOUR
TI  - a2o: Access to Archives from the National Archives of Singapore
AU  - Beasley, Sarah
AU  - Kail, Candice
T2  - Journal of Web Librarianship
AB  - The article offers information about a2o that was created by the National Archives of Singapore in 2009. Accordingly, a2o is taken after the chemical symbol of water, which is considered as an essential element of life. It provides access to various databases, photographs, maps and plans, oral history audio files, and other audiovisual recordings in multiple ways. It also offers a variety of online exhibitions, including "Colours Behind Barbed Wires: A Prisoner of War's Story through Haxworth's Sketches" and "Colours in the Wind: Hill Street Police Station in Retrospect."
DA  - 2009/06/23/
PY  - 2009
DO  - 10.1080/19322900902896531
VL  - 3
IS  - 2
SP  - 149
EP  - 155
SN  - 1932-2909
UR  - http://www.tandfonline.com/doi/abs/10.1080/19322900902896531
ER  - 

TY  - JOUR
TI  - Copyright in the networked world: digital legal deposit
AU  - Seadle, Michael
T2  - Library Hi Tech
AB  - Legal deposit is the requirement that particular types of material be deposited with a national library or designated research libraries. US law does not at present include any requirement for the deposit of works that exist solely in the form of Web pages. For digital materials, it makes no sense to write rules for legal deposit based on the medium. Nations and national libraries that ignore legal deposit for digital works will find themselves missing a significant and unrecoverable portion of their cultural heritage
DA  - 2001/09//
PY  - 2001
DO  - 10.1108/EUM0000000005893
VL  - 19
IS  - 3
SP  - 299
EP  - 303
SN  - 0737-8831
UR  - http://www.emeraldinsight.com/doi/10.1108/EUM0000000005893
KW  - copyright
KW  - publishing
ER  - 

TY  - JOUR
TI  - The Kulturarw Project — The Swedish Royal Web Archive
AU  - Arvidson, Allan
AU  - Lettenström, Frans
T2  - The Electronic Library
AB  - KB (Kungliga biblioteket, The Royal Library), The National Library of Sweden, was founded in the 1500s. Since 1661, when the first Legal Deposit Law was introduced, it has functioned as the kingdom's national memory. Today KB receives everything printed that is distributed to the public in the form of books, journals, posters, maps, advertisements, catalogues and so on. Since 1994, following the latest version of the Legal Deposit Law, KB has also stored electronic publications ‘in fixed form’, i.e. published on CD‐ROM, tape or diskette. The total growth at KB is about 1.5 shelf‐kilometres per year. At that rate, KB's underground storage will be completely full by the year 2050.
DA  - 1998/02//
PY  - 1998
DO  - 10.1108/eb045623
VL  - 16
IS  - 2
SP  - 105
EP  - 108
SN  - 0264-0473
UR  - http://www.emeraldinsight.com/doi/10.1108/eb045623
ER  - 

TY  - JOUR
TI  - Archiving in the networked world: betting on the future
AU  - Seadle, Michael
T2  - Library Hi Tech
A2  - Wusteman, Judith
AB  - Purpose – The goal of this column is not to argue the pros and cons of digital archiving, or to propose solutions to its problems, but to describe it as a research subject and a social phenomenon. Design/methodology/approach – This column relies on cultural anthropology, in particular the approach that Clifford Geertz championed, and for cultural anthropology, language and its social context matter. Findings – Archiving systems abound with competing claims about effectiveness. Transparency and evidence of public testing is rare, with a few exceptions. The lack of public testing does not mean that systems do less than they claim, but it does mean that libraries, archives and museums need to press for proof if they want to have confidence in the product. Originality/value – When betting on the future, these cannot be certainty, but bets placed should be based on knowledge.
DA  - 2009/06/12/
PY  - 2009
DO  - 10.1108/07378830910968326
VL  - 27
IS  - 2
SP  - 319
EP  - 325
SN  - 0737-8831
UR  - https://www.emeraldinsight.com/doi/10.1108/07378830910968326
KW  - Digital libraries
KW  - Museums
KW  - Library and information networks
ER  - 

TY  - JOUR
TI  - Developments in Digital Preservation at the University of Illinois: The Hub and Spoke Architecture for Supporting Repository Interoperability and Emerging Preservation Standards
AU  - Thomas Habing
AU  - Janet Eke
AU  - Matthew A. Cordial
AU  - William Ingram
AU  - Robert Manaster
T2  - Library Trends
AB  - Funded by the National Digital Information Infrastructure and Preservation Program (NDIIPP), the ECHO DEPository Project supports the digital preservation efforts of the Library of Congress by contributing research and software to help society GET, SAVE, and KEEP its digital cultural heritage. Project activities include building Web archiving tools, evaluating existing repository software, developing architectures to enhance existing repositories' interoperability and preservation features, and modeling next-generation repositories for supporting long-term preservation. This article describes the development of the Hub and Spoke (HandS) Tool Suite, built to help curators of digital objects manage content in multiple repository systems while preserving valuable preservation metadata. Implementing METS and PREMIS, HandS provides a standards-based method for packaging content that allows digital objects to be moved between repositories more easily while supporting the collection of technical and provenance information crucial for long-term preservation. Related project work investigating the more fundamental semantic issues underlying the preservation of the meaning of digital objects over time is profiled separately in this issue (Dubin et al., 2009). [ABSTRACT FROM AUTHOR]
DA  - 2009///
PY  - 2009
DO  - 10.1353/lib.0.0052
VL  - 57
IS  - 3
SP  - 556
EP  - 579
SN  - 1559-0682
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://muse.jhu.edu/content/crossref/journals/library_trends/v057/57.3.habing.html
KW  - Web archiving
KW  - Information science
KW  - Web archives
KW  - Digitization of archival materials
KW  - Digital preservation
KW  - Digitization
KW  - Library science
KW  - Digitization of library materials
KW  - Archives -- Computer network resources
KW  - Preservation of materials
KW  - Library of Congress
ER  - 

TY  - CHAP
TI  - Exploiting the Social and Semantic Web for Guided Web Archiving
AU  - Risse, Thomas
AU  - Dietze, Stefan
AU  - Peters, Wim
AU  - Doka, Katerina
AU  - Stavrakas, Yannis
AU  - Senellart, Pierre
AB  - The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33290-6_47. ; German Federal Ministry for the Environment, Nature Conservation and Nuclear Safety/0325296 ; Solland Solar Cells BV ; SolarWorld Innovations GmbH ; SCHOTT Solar AG ; RENA GmbH ; SINGULUS TECHNOLOGIES AG
CY  - Germany, Europe
DA  - 2012///
PY  - 2012
SP  - 426
EP  - 432
PB  - Heidelberg : Springer Verlag
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://link.springer.com/10.1007/978-3-642-33290-6_47
L4  - http://link.springer.com/10.1007/978-3-642-33290-6_47
KW  - Web archives
KW  - Web Archiving
KW  - Digital libraries
KW  - Artificial intelligence
KW  - Court decisions
KW  - ddc:004
KW  - Meta information
KW  - Semantic information
KW  - Social Web
KW  - Text Analysis
KW  - Web content
KW  - Web Crawler
ER  - 

TY  - JOUR
TI  - Digital Preservation through Archival Collaboration: The Data Preservation Alliance for the Social Sciences.
AU  - Altman, Micah
AU  - Adams, Margaret O
AU  - Crabtree, Jonathan
AU  - Donakowski, Darrell
AU  - Maynard, Marc
AU  - Pienta, Amy
AU  - Young, Copeland H
T2  - American Archivist
AB  - The Data Preservation Alliance for the Social Sciences (Data-PASS) is a partnership of five major U.S. institutions with a strong focus on archiving social science research. The Library of Congress supports the partnership through its National Digital Information Infrastructure and Preservation Program (NDIIPP). The goal of Data-PASS is to acquire and preserve data from opinion polls, voting records, large-scale surveys, and other social science studies at risk of being lost to the research community. This paper discusses the agreements, processes, and infrastructure that provide a foundation for the collaboration. [ABSTRACT FROM AUTHOR]
DA  - 2009///
PY  - 2009
VL  - 72
IS  - 1
SP  - 170
EP  - 184
SN  - 03609081
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Digitization of archival materials
KW  - Digital preservation
KW  - Metadata
KW  - Electronic records
KW  - Information resources management
KW  - Archives -- Computer network resources
KW  - Preservation of materials
KW  - Archives collection management
KW  - Document imaging systems
KW  - Social science methodology
KW  - Social science research
ER  - 

TY  - JOUR
TI  - Embracing Web 2.0: Archives and the Newest Generation of Web Applications.
AU  - Samouelian, Mary
T2  - American Archivist
AB  - Archivists are converting physical collections to digital formats and displaying surrogates of these primary sources on their websites. Simultaneously, the Web is moving toward a shared environment that embraces collective intelligence and participation, which is often called Web 2.0. This paper investigates the extent to which Web 2.0 features have been integrated into archival digitization projects. Although the use of Web 2.0 features has not yet been widely discussed in the professional archival literature, this exploratory study of college and university repository websites in the United States suggests that archival professionals are embracing Web 2.0 to promote their digital content and redefine relationships with their patrons. [ABSTRACT FROM AUTHOR]
DA  - 2009///
PY  - 2009
VL  - 72
IS  - 1
SP  - 42
EP  - 71
SN  - 03609081
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Digitization of archival materials
KW  - Digital preservation
KW  - Digitization
KW  - Institutional repositories
KW  - Archivists
KW  - Archival materials
KW  - University & college archives
KW  - Collection management (Libraries)
KW  - Internet publishing
KW  - Scholarly electronic publishing
KW  - Scholarly websites
KW  - Technological innovations
KW  - Web 2.0
ER  - 

TY  - JOUR
TI  - Arcomem Crawling Architecture
AU  - Plachouras, Vassilis
AU  - Carpentier, Florent
AU  - Faheem, Muhammad
AU  - Masanès, Julien
AU  - Risse, Thomas
AU  - Senellart, Pierre
AU  - Siehndel, Patrick
AU  - Stavrakas, Yannis
T2  - Future Internet
AB  - The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
DA  - 2014/08/19/
PY  - 2014
DO  - 10.3390/fi6030518
VL  - 6
IS  - 3
SP  - 518
EP  - 541
SN  - 1999-5903
UR  - http://www.mdpi.com/1999-5903/6/3/518
KW  - web archiving
KW  - content acquisition
KW  - crawling architecture
ER  - 

TY  - JOUR
TI  - Web archiving: ethical and legal issues affecting programmes in Australia and the Netherlands
AU  - Glanville, Lachlan
T2  - The Australian Library Journal
AB  - Digital preservation is a major concern for libraries and organisations internationally. This paper will examine the barriers faced by web archiving programmes in national libraries, such as the Koninklijke Bibliotheek in the Netherlands and the National Library of Australia’s PANDORA. The report will analyse how these programmes deal with the difficulties and limitations inherent in such programmes by examining how they approach issues of selection, access and copyright, while drawing comparisons between the programmes of the two institutions and the legal frameworks in which they function.
DA  - 2010/08//
PY  - 2010
DO  - 10.1080/00049670.2010.10735999
VL  - 59
IS  - 3
SP  - 128
EP  - 134
SN  - 0004-9670
UR  - http://www.tandfonline.com/doi/abs/10.1080/00049670.2010.10735999
ER  - 

TY  - JOUR
TI  - Technology Intersecting Culture: The British Slave Trade Legacies Project
AU  - Roberto, Rose
T2  - Journal of the Society of Archivists
AB  - ‘British Slave Trade Legacies’ is a web archiving project that collected websites and online material related to and generated from the 2007 bicentenary of Parliament abolishing the British slave trade. The Internet Archive donated their Archive-It service to harvest websites for this collection, and now provides public access to digital objects within it. This paper describes two issues that the project raised: firstly, the validity of the 2007 anniversary as marked by cultural stakeholders; secondly, the challenges of documenting it, thereby adding to historical legacy material of this topic. The archivist’s role in the 21st century will also be discussed in the context of new digital age challenges.
DA  - 2008/10/22/
PY  - 2008
DO  - 10.1080/00379810902916274
VL  - 29
IS  - 2
SP  - 207
EP  - 232
SN  - 0037-9816
UR  - http://www.tandfonline.com/doi/full/10.1080/00379810902916274
ER  - 

TY  - JOUR
TI  - Separating the Wheat from the Chaff: Identifying Key Elements in the NLA .Au Domain Harvest
AU  - Fellows, Geoff
AU  - Harvey, Ross
AU  - Lloyd, Annemaree
AU  - Pymm, Bob
AU  - Wallis, Jake
T2  - Australian Academic & Research Libraries
AB  - In 2005 and 2006 the National Library of Australia (NLA) carried out two whole-domain web harvests which complement the selective web archiving approach taken by PANDORA. Web harvests of this size pose significant challenges to their use. Despite these challenges, such harvests present fascinating research opportunities. The NLA has provided Charles Sturt University’s POA (Preservation for Ongoing Accessibility) research group with access to these web harvests and associated keyword indexes. This paper describes the 2006 harvest and uses the example of blogs to address how to identify material within the harvest and determine issues that need further investigation.
DA  - 2008/09//
PY  - 2008
DO  - 10.1080/00048623.2008.10721346
VL  - 39
IS  - 3
SP  - 137
EP  - 148
SN  - 0004-8623
UR  - http://www.tandfonline.com/doi/abs/10.1080/00048623.2008.10721346
ER  - 

TY  - BOOK
TI  - The Web as History: Using Web Archives to Understand the Past and the Present
A3  - Brügger, Niels
A3  - Schroeder, Ralf
AB  - London: UCL Press, c2017
CY  - United States, North America
DA  - 2017///
PY  - 2017
ET  - 1st
PB  - UCL Press
SN  - 978– 1– 911307– 56– 3
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://discovery.ucl.ac.uk/1542998/1/The-Web-as-History.pdf
KW  - Web archiving
KW  - History -- Methodology
KW  - Z701.3 .W43
ER  - 

TY  - GEN
TI  - Deriving Dynamics of Web Pages: A Survey
AU  - Senellart, Pierre
AU  - Oita, Marilena
AB  - The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.
DA  - 2011///
PY  - 2011
PB  - HAL CCSD
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://ceur-ws.org/Vol-707/TWAW2011-paper4.pdf
KW  - Web archiving
KW  - [INFO.INFO-WB] Computer Science [cs]/Web
KW  - ACM : H.3.5.2
KW  - Change monitoring
KW  - Timestamping
ER  - 

TY  - BOOK
TI  - Web Archiving Effort in National Library of China: Paper - iPRES 2012 - Digital Curation Institute, iSchool, Toronto
AU  - Yunpeng, Qu
AB  - In this paper, we introduce the effort in National Library of China in recent years, including resources accumulation, software development and works in Promotion Project in China. We have developed a platform for Chinese web archiving. And we are building some sites to propagate our works to the nation. At last we figure out some questions about the web archiving in China.
CY  - Austria, Europe
DA  - 2012///
PY  - 2012
PB  - Digital Curation Institute, iSchool University of Toronto
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - Conferences
KW  - Canada
KW  - Conference 2012
KW  - iPRES
KW  - iSchool
KW  - National Library of China
KW  - Toronto
ER  - 

TY  - JOUR
TI  - The Missing Link: Observations on the Evolution of a Web Archive.
AU  - Fansler, Craig
AU  - Gilbertson, Kevin
AU  - Petersen, Rebecca
T2  - Journal for the Society of North Carolina Archivists
AB  - The web is vast and unorganized, making it difficult to collect and to curate for archival and research purposes. In this article, we discuss web archiving in the scope of a university archive, the challenges associated with such web archiving, and archival strategies for building and maintaining a web archive. This article chronicles our experience developing appropriate standards of practice for this medium, providing adequate metadata for the digital objects, constructing precise capturing protocols, and sharing access to these online collections. While some difficulties lie in transforming our archival modalities from print to digital, an equal share of obstacles relate to the speed, scale, and distribution technologies of the web itself. [ABSTRACT FROM AUTHOR]
DA  - 2014///
PY  - 2014
VL  - 11
IS  - 1
SP  - 46
EP  - 59
SN  - 19458533
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Web archives
KW  - Metadata
KW  - Archival materials
KW  - Archival processing
KW  - Automatic data collection systems
KW  - University & college archives
ER  - 

TY  - JOUR
TI  - Introducing Web Archives as a New Library Service: the Experience of the National Library of France
AU  - Aubry, Sara
T2  - LIBER Quarterly
AB  - The collections held by the National Library of France (BnF) are part of the national heritage and include nearly 31 million documents of all types (books, journals, manuscripts, photographs, maps, etc.). New collection challenges have been posed by the emergence of the Internet. Within an international framework, the BnF is developing policy guidelines, workflows and tools to harvest relevant and representative segments of the French part of the Internet and organise their preservation and access. The Web archives of the French national domain were developed as a new service, released as a new application and made available to the public in April 2008. Since then, strategies have been and continue to be developed to involve librarians and reach out end users. This article will discuss the BnF experiment and will focus specifically on four issues: * collection building: Web archives as a new and challenging collection, * resource discovery: access services and tools for end users, * usage: facts and figures, * involvement: strategies to build a librarian community and reach out end users.
DA  - 2010/09/29/
PY  - 2010
DO  - 10.18352/lq.7987
VL  - 20
IS  - 2
SP  - 179
SN  - 2213-056X
UR  - https://www.liberquarterly.eu/article/10.18352/lq.7987/
KW  - web archives
KW  - collection building
KW  - France
KW  - archiving websites
KW  - end users
KW  - resource discovery
KW  - usage
ER  - 

TY  - JOUR
TI  - Behind the Scenes of the Global Information Society: Libraries and Big-time Politics
AU  - Kuzmin, Evgeniy I.
T2  - Bibliotekovedenie [Library and Information Science (Russia)]
AB  - The paper examines the challenges facing libraries in the new information environment. Accessibility and preservation of information, information ethics, promotion of media and information literacy and reading, the promotion of multilingualism and diversity in cyberspace are a reflection of the global problems, solving them libraries contribute to the creation of the information society.
DA  - 2013/04/23/
PY  - 2013
DO  - 10.25281/0869-608X-2013-0-2-13-18
IS  - 2
SP  - 13
EP  - 18
SN  - 2587-7372
UR  - http://bibliotekovedenie.rsl.ru/jour/article/view/848
ER  - 

TY  - JOUR
TI  - The Experience of the National Libraries Abroad of the Collection and Longterm Preservation of Internet Resources
AU  - Brakker, Nadezhda V.
AU  - Kujbyshev, Leonid A.
T2  - Bibliotekovedenie [Library and Information Science (Russia)]
AB  - A review of National Libraries experience of WEB harvesting, archiving technologies and legal issues. The paper suggests an overlook of experience and experiments of National Libraries of Austria, Germany, China, Lithuania, the Netherlands, New Zeeland, Northway, Portugal, United Kingdom, USA, Finland, France, Czech Republic and Sweden.
DA  - 2013/04/23/
PY  - 2013
DO  - 10.25281/0869-608X-2013-0-2-88-96
IS  - 2
SP  - 88
EP  - 96
SN  - 2587-7372
UR  - http://bibliotekovedenie.rsl.ru/jour/article/view/860
ER  - 

TY  - JOUR
TI  - The Digital Documents Harvesting and Processing Tool.
AU  - Grimshaw, Jennie
T2  - ALISS Quarterly
AB  - The article presents an overview of the Digital Documents Harvesting and Processing Tool (DDHAPT). Topics discussed include extension of legal deposit to cover electronic publications under the Legal Deposit Libraries (Non-Print Works) Regulations, DDHAPT web based application is an extension of the W3ACT tool used for web archiving by the British Library and the other legal deposit libraries and DDHAPT enabling a selector to set up a list of unique URLs to be crawled at set intervals.
DA  - 2015/01//
PY  - 2015
VL  - 10
IS  - 2
SP  - 6
EP  - 8
SN  - 17479258
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Uniform Resource Locators
KW  - Access control of electronic records
KW  - Electronic records management
KW  - etc.
KW  - Legal deposit of books
ER  - 

TY  - JOUR
TI  - Archiving in the Age of Digital Conversion: Notes for a Politics of "Remains."
AU  - Méchoulan, Éric
T2  - Substance: A Review of Theory & Literary Criticism
AB  - The article focuses on archiving in the digital age. The author notes that caught in between materiality of the means of preservation and communication of documents and the relationships of power and of the institutions of the past is archiving. The archive is a form of social transmission, a process that transforms a text, image or sound into a document, an authorization to endure beyond ephemerality. This article was translated by Roxanne Lapidus.
DA  - 2011/05//
PY  - 2011
VL  - 40
IS  - 2
SP  - 92
EP  - 104
SN  - 00492426
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://www.jstor.org/stable/pdf/41300202.pdf
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - ARCHIVES
KW  - ELECTRONIC information resources
KW  - INFORMATION resources
KW  - LAPIDUS
KW  - Roxanne
ER  - 

TY  - JOUR
TI  - The Interconnected Web: A Paradigm for Managing Digital Preservation.
AU  - Brown, Heather
T2  - World Digital Libraries
AB  - Digital preservation management has evolved from an initial emphasis on technological issues to a broader understanding of resourcing and organizational issues. Internationally, the trend has moved to a risk management framework that is common to both digital and physical worlds. There are a number of common 'high level' principles and frameworks that intersect both digital and traditional (physical) preservation, and which in turn provide an opportunity to explore an integrated approach to preserving both digital and physical materials. This paper explores the opportunity for such an integrated approach through the paradigm of an interconnected web. [ABSTRACT FROM AUTHOR]
DA  - 2013/03//
PY  - 2013
DO  - 10.3233/WDL-120096
VL  - 6
IS  - 1
SP  - 1
SN  - 0974567X
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://content.iospress.com/articles/world-digital-libraries-an-international-journal/wdl120096
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - Digital preservation
KW  - Preservation
KW  - DIGITIZATION of library materials
KW  - Digital preservation policy
KW  - LIBRARY science research
KW  - Risk management
KW  - RISK management in business
KW  - Training
ER  - 

TY  - JOUR
TI  - The Sharc framework for data quality in Web archiving
AU  - Denev, Dimitar
AU  - Mazeika, Arturas
AU  - Spaniol, Marc
AU  - Weikum, Gerhard
T2  - The VLDB Journal
DA  - 2011/04/02/
PY  - 2011
DO  - 10.1007/s00778-011-0219-9
VL  - 20
IS  - 2
SP  - 183
EP  - 207
SN  - 1066-8888
UR  - http://link.springer.com/10.1007/s00778-011-0219-9
KW  - Web archiving
KW  - Blur
KW  - Coherence
KW  - Crawls strategies
KW  - Data quality
ER  - 

TY  - JOUR
TI  - Formátová analyza sklízenych dat vrámci projektu Webarchiv NK ČR. (Czech)
AU  - Kvasnica, Jaroslav
AU  - Kreibich, Rudolf
T2  - File Format Recognition of Data Harvested by Web Archiving Project of National Library of the Czech Republic. (English)
AB  - National Library of the Czech Republic just begun to ingest harvested data from web archiving project into Long-term Preservation System. This article is output of Institutional Science and Research project aiming to implement retrospective file format recognition framework for harvested data and map tools related to file format recognition. Precise knowledge of archived data is cornerstone for building Long-term Preservation Strategy. Such analysis may also improve conditions of end-user access. (English) [ABSTRACT FROM AUTHOR]
DA  - 2013/09//
PY  - 2013
IS  - 2
SP  - 1
SN  - 18042406
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - WARC
KW  - web archive
KW  - ARC
KW  - archiving
KW  - Heritrix
KW  - archivace
KW  - dlouhodobá ochrana digitálních dokumentů
KW  - file formats
KW  - FILE organization (Computer science)
KW  - long term preservation
KW  - METADATA harvesting
KW  - Národní digitální knihovna
KW  - NARODNI knihovna Ceske republiky
KW  - National digital library
KW  - souborové formáty
KW  - web archiv
ER  - 

TY  - JOUR
TI  - New medium, old archives? Exploring archival potential in The Live Art Collection of the UK Web Archive
AU  - Bartlett, Vanessa
T2  - International Journal of Performance Arts and Digital Media
AB  - This article speculates about the new kinds of historical information that performance scholars may be able to preserve as a result of recent innovations in web archiving. Using The Live Art Collection of the UK Web Archive as its case study, the article draws on influences from oral history, new media theory and the digital humanities. Beginning with an assertion that the Web has a tendency to aggregate existing media forms into one archival location, the article makes the case that online writing is key to web archiving's potential to document new kinds of knowledge about performance and live art. Subsequently it points to limitations in the current archival structures of the collection and concludes that further innovation is required in order to maximize the scholarly potential of the material contained within it. Interviews with the team who manage and curate the collection are used throughout to support assertions about the collections intended use and functions. [ABSTRACT FROM AUTHOR]
DA  - 2014/01/02/
PY  - 2014
DO  - 10.1080/14794713.2014.912504
VL  - 10
IS  - 1
SP  - 91
EP  - 103
SN  - 1479-4713
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.tandfonline.com/doi/abs/10.1080/14794713.2014.912504
KW  - web archiving
KW  - DIGITAL libraries
KW  - WEB archives
KW  - DOCUMENTATION
KW  - ART museums
KW  - DIGITAL humanities
KW  - internet
KW  - live art
KW  - oral history
KW  - UK Web Archive
ER  - 

TY  - JOUR
TI  - Development of the National Library of the Czech Republic 2011–2016: Past, Present and Future
AU  - Böhm, Tomáš
T2  - Alexandria: The Journal of National and International Library and Information Issues
AB  - The National Library of the Czech Republic, which was founded in 1773 by the Austrian Empress Maria Theresa, is one of the oldest National Libraries in Europe. It has been through various organizational changes incorporating other libraries and institutions. In addition to providing traditional library service, the library is active in such fields as digitization, paper documents restoration and preservation, refurbishment of its main seat in the baroque Klementinum building and international cooperation. The most important digitization project is the creation of the National Digital Library, which will also serve as the LTP (Long Term Preservation) repository for other digitization projects carried out by either the National Library or by other libraries and institutions in the Czech Republic. Other projects in this field are: the world's biggest digital manuscript library (Manuscriptorium), creation of the Web Archive, digitization of rare books in partnership with Google, formation of the repository for digitized Czech cultural heritage and, together with other main Czech libraries, work on the creation of the Czech Libraries Portal. The Library is further active in paper documents restoration and preservation where it is trying to tackle the problem of de-acidification as well as the formation of the physical Czech Depository Library and the Interdisciplinary Methodological Centre for Book Restoration and Conservation. The Library continues to serve its users during the refurbishment of the Klementinum. It aims to create 'a modern library in baroque walls' by the end of 2018. Furthermore, a new physical depository has been bulit on the outskirts of Prague. [ABSTRACT FROM AUTHOR]
DA  - 2014/12//
PY  - 2014
DO  - 10.7227/ALX.0028
VL  - 25
IS  - 3
SP  - 17
EP  - 24
SN  - 0955-7490
UR  - http://10.0.28.59/ALX.0028
L4  - http://journals.sagepub.com/doi/10.7227/ALX.0028
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - Web archives
KW  - digitization
KW  - Digitization of library materials
KW  - library refurbishment
KW  - Narodni knihovna Ceske republiky
KW  - National Library of the Czech Republic
KW  - paper documents restoration
ER  - 

TY  - JOUR
TI  - Web historiography and Internet Studies: Challenges and perspectives
AU  - Brügger, Niels
T2  - New Media & Society
AB  - I argue that web historiography should be placed higher on the Internet Studies’ research agenda, since a better understanding of the web of the past is an important condition for gaining a more complete understanding of the web of today, regardless of our focus (e.g. political economy, language and culture, social interaction or everyday use). Building on reflections about ’historiography’ and the ’web’, I discuss several major challenges of web historiography vis-à-vis historiography in general, focusing on the characteristics of the archived website and the web sphere, and the consequences of these characteristics for web historians. I conclude by outlining future directions for web historiography. [ABSTRACT FROM AUTHOR]
DA  - 2013/08/21/
PY  - 2013
DO  - 10.1177/1461444812462852
VL  - 15
IS  - 5
SP  - 752
EP  - 764
SN  - 1461-4448
UR  - http://10.0.4.153/1461444812462852
L4  - http://journals.sagepub.com/doi/10.1177/1461444812462852
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - RESEARCH
KW  - web archiving
KW  - INTERNET
KW  - Internet
KW  - web
KW  - web history
KW  - HISTORIOGRAPHY
KW  - LANGUAGE & culture
KW  - SOCIAL interaction
KW  - WEB services
KW  - web sphere
KW  - website
ER  - 

TY  - JOUR
TI  - Practical Digital Preservation: A How-to Guide for Organizations of Any Size2014 2 Practical Digital Preservation: A How-to Guide for Organizations of Any Size London Facet 2013 336 pp. ISBN978-1-85604-755-5 £49.95 soft cover
AU  - Schellnack-Kelly, Isabel
T2  - The Electronic Library
DA  - 2014/11/03/
PY  - 2014
DO  - 10.1108/EL-02-2014-0033
VL  - 32
IS  - 6
SP  - 924
EP  - 925
SN  - 0264-0473
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.emeraldinsight.com/doi/10.1108/EL-02-2014-0033
KW  - Web archiving
KW  - Digital preservation
KW  - Internet
KW  - Articles
KW  - Book review
KW  - Digital infrastructure
KW  - Information & communications technology
KW  - Information & knowledge management
KW  - Librarianship/library management
KW  - Library & information science
KW  - Library technology
ER  - 

TY  - JOUR
TI  - The state of e-legal deposit in France: Looking back at five years of putting new legislation into practice and envisioning the future.
AU  - Stirling, Peter
AU  - Illien, Gildas
AU  - Sanz and, Pascal
AU  - Sepetjan, Sophie
T2  - IFLA Journal
AB  - The article describes the legal situation in France regarding the legal deposit of digital material, and shows how it has been implemented in practice at the Bibliothèque nationale de France (BnF). The focus is on web archiving, where the BnF has experience going back almost 10 years, but other aspects of digital legal deposit are discussed, with possible future developments and challenges. Throughout comparisons are made with the situations in other countries. [ABSTRACT FROM PUBLISHER]
DA  - 2012/03//
PY  - 2012
VL  - 38
IS  - 1
SP  - 5
EP  - 24
SN  - 03400352
UR  - http://10.0.4.153/0340035211435323
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - etc.
KW  - Legal deposit of books
KW  - Bibliothèque nationale de France
KW  - Copyright -- France
KW  - digital legal deposit
KW  - Electronic publication laws
KW  - France
ER  - 

TY  - JOUR
TI  - Archiving in the networked world: preserving plagiarized works
AU  - Seadle, Michael
T2  - Library Hi Tech
AB  - Purpose – Plagiarism has become a salient issue for universities and thus for university libraries in recent years. This paper aims to discuss three interrelated aspects of preserving plagiarized works: collection development issues, copyright problems, and technological requirements. Too often these three are handled separately even though in fact each has an influence on the other. Design/methodology/approach – The paper looks first at the ingest process (called the Submission Information Package or SIP), then at storage management in the archive (the AIP or Archival Information Package), and finally at the retrieval process (the DIP or Distribution Information Package). Findings – The chief argument of this paper is that works of plagiarism and the evidence exposing them are complex objects, technically, legally and culturally. Merely treating them like any other work needing preservation runs the risk of encountering problems on one of those three fronts. Practical implications – This is a problem, since currently many public preservation strategies focus on ingesting large amounts of self-contained content that resembles print on paper, rather than on online works that need special handling. Archival systems also often deliberately ignore the cultural issues that affect future usability. Originality/value – The paper discusses special handling and special considerations for archiving works of plagiarism. [ABSTRACT FROM AUTHOR]
DA  - 2011/11/22/
PY  - 2011
DO  - 10.1108/07378831111189750
VL  - 29
IS  - 4
SP  - 655
EP  - 662
SN  - 0737-8831
UR  - http://10.0.4.84/07378831111189750
L4  - https://www.emeraldinsight.com/doi/10.1108/07378831111189750
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Digital libraries
KW  - Digital preservation
KW  - Archiving
KW  - Preservation
KW  - Germany
KW  - Information retrieval
KW  - Collection development in libraries
KW  - Collections management
KW  - Copyright & digital preservation
KW  - Information resources management
KW  - Intellectual property
KW  - Plagiarism
ER  - 

TY  - JOUR
TI  - The arcomem Architecture for Social- and Semantic-Driven Web Archiving
AU  - Risse, Thomas
AU  - Demidova, Elena
AU  - Dietze, Stefan
AU  - Peters, Wim
AU  - Papailiou, Nikolaos
AU  - Doka, Katerina
AU  - Stavrakas, Yannis
AU  - Plachouras, Vassilis
AU  - Senellart, Pierre
AU  - Carpentier, Florent
AU  - Mantrach, Amin
AU  - Cautis, Bogdan
AU  - Siehndel, Patrick
AU  - Spiliotopoulos, Dimitris
T2  - Future Internet, Vol 6, Iss 4, Pp 688-716 (2014) VO - 6
AB  - The constantly growing amount ofWeb content and the success of the SocialWeb lead to increasing needs for Web archiving. These needs go beyond the pure preservationo of Web pages. Web archives are turning into “community memories” that aim at building a better understanding of the public view on, e.g., celebrities, court decisions and other events. Due to the size of the Web, the traditional “collect-all” strategy is in many cases not the best method to build Web archives. In this paper, we present the ARCOMEM (From Future Internet 2014, 6 689 Collect-All Archives to Community Memories) architecture and implementation that uses semantic information, such as entities, topics and events, complemented with information from the Social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts.
DA  - 2014/11/04/
PY  - 2014
DO  - 10.3390/fi6040688
VL  - 6
IS  - 4
SP  - 688
SN  - 1999-5903
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.mdpi.com/1999-5903/6/4/688/
L4  - http://www.mdpi.com/1999-5903/6/4/688
L4  - http://www.mdpi.com/1999-5903/6/4/688/htm
KW  - web archiving
KW  - Mathematics
KW  - web crawler
KW  - architecture
KW  - Electronic computers. Computer science
KW  - Information technology
KW  - Instruments and machines
KW  - Science
KW  - social Web
KW  - T58.5-58.64
KW  - text analysis
ER  - 

TY  - JOUR
TI  - Focused crawler for events.
AU  - Farag, Mohamed M G
AU  - Lee, Sunshin
AU  - Fox, Edward A
T2  - International Journal on Digital Libraries
AB  - There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%. [ABSTRACT FROM AUTHOR]
DA  - 2018/03//
PY  - 2018
DO  - http://dx.doi.org/10.1007/s00799-016-0207-1
VL  - 19
IS  - 1
SP  - 3
EP  - 19
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/2002183191?accountid=27464
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://10.0.3.239/s00799-016-0207-1
KW  - Web archiving
KW  - DIGITAL libraries
KW  - WORLD Wide Web
KW  - Digital libraries
KW  - WEB archives
KW  - Archiving
KW  - AUTOMATIC tracking
KW  - Data analysis
KW  - Event archiving
KW  - Event modeling
KW  - Focused crawling
KW  - Library And Information Sciences--Computer Applica
KW  - Representations
KW  - Shooting
KW  - WEBSITES
KW  - World Wide Web
ER  - 

TY  - RPRT
TI  - Selecting websites in an encyclopaedic national library ; La sélection de sites web dans une bibliothèque nationale encyclopédique ; Selecting websites in an encyclopaedic national library : A shared collection policy for BnF internet legal deposit ; La s
AU  - Bonnel, Sylvie
AU  - Oury, Clément
AB  - International audience ; En quelques années, le web est devenu l'un des principaux vecteurs d'expression et de consommation culturelles de la société française ; les publications en ligne ont rejoint notre patrimoine. Celui-ci est d'autant plus précieux qu'il est fragile. En France, il a été décidé d'inscrire la mission de conservation de l'internet dans le sillage pluriséculaire du dépôt légal. Cependant, l'adaptation de ce dispositif juridique et scientifique à un espace de diffusion aussi vaste et étendu n'a rien d'évident. La BnF définit son périmètre de collecte par une série de restrictions successives : juridiques, techniques et économiques. Pour assurer la représentativité de son dépôt légal, la BnF a également adopté un modèle original d'archivage, qui associe des collectes « larges » du domaine national et des approches plus ciblées de sites identifiés par des bibliothécaires de la BnF ou par des partenaires. La BnF a ainsi été amenée à appliquer des logiques de sélection dans un cadre de dépôt légal. À cette fin, chaque département associé à la collecte du web a élaboré, au fil des expérimentations, sa propre stratégie documentaire. Les « correspondants » du dépôt légal du web ont adopté des logiques non pas contradictoires mais complémentaires : sélection / échantillonnage, continuité des collections / exploration de nouveaux territoires. C'est désormais à une synthèse de ces différentes politiques que la BnF doit s'atteler, dans le cadre de la refonte de sa charte documentaire et dans un contexte où les contraintes budgétaires appellent à la définition de priorités plus affirmées.
CY  - France, Europe
DA  - 2014///
PY  - 2014
PB  - HAL CCSD
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - [ SHS.INFO ] Humanities and Social Sciences/Librar
KW  - archivage web
KW  - bibliothèques
KW  - content policy
KW  - Dépôt légal internet
KW  - Internet legal deposit
KW  - libraries
KW  - politique documentaire
ER  - 

TY  - JOUR
TI  - Forget me net, not.
AU  - HOCKX-Yu, Helen
AU  - KAHLE, Brewster
T2  - Newsweek Global
AB  - The article discusses web archiving, focussing on a private project, Internet Archive, founded by Brewster Kahle and the project of the British Library to capture and preserve every web-page in the British domain, .co.uk, led by Helen Hockx-Yu. Topics include estimates of the amount of digital data created each year, estimates of the amount of data lost or altered in a year and the evolution of the role of libraries as they branch out to web archiving.
DA  - 2014/07/11/
PY  - 2014
VL  - 163
IS  - 2
SP  - 1
EP  - 6
SN  - 00289604
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - WEB archives
KW  - INTERNET Archive (Firm)
KW  - BRITISH Library
KW  - Brewster
KW  - KAHLE
KW  - Helen
KW  - HOCKX-Yu
ER  - 

TY  - RPRT
TI  - Collection & community building through web archiving: engaging with faculty and students in a collaborative web archiving project
AU  - Schuler, Andrea
AB  - Tisch Library at Tufts University has recently begun a pilot web archiving project, aiming to deepen Tufts’ collections in areas of strategic importance and support more “traditional” library collection development activities, while collecting material that is not known to be comprehensively collected by other institutions. Additionally, the project offers an opportunity for collaborative collection building with faculty and students that serves as a unique way to deepen our community‘s engagement with the library. The initial pilot collection focuses on environmental justice, selected due to its relevance to the Tufts community and curriculum and to build on existing Tisch Library collection strengths. Two undergraduate courses related to environmental justice were identified and invited to partner in the pilot project. This partnership would leverage student research to expand the initial collection while introducing students to concepts of web archiving and information literacy around websites and providing them with the opportunity to contribute to shaping the scholarly record. Both courses added a brief assignment to their syllabus: while doing research on their chosen topics, students would identify 3-7 web sites they felt would benefit from preservation and submit the sites to the library, to be evaluated and added to the web archive as appropriate. This presentation discusses the process of beginning a subject-based web archiving project, focusing on the collaborative project with two undergraduate classes. It addresses decisions made when starting and scoping the project; collection development issues; the logistics, benefits, and outcomes of the student and faculty collaboration; and future directions.
CY  - United States, North America
DA  - 2017///
PY  - 2017
PB  - Digital USD
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - collaboration
KW  - outreach
KW  - digital collections
KW  - collection development
KW  - Library and Information Science
KW  - undergraduates
ER  - 

TY  - JOUR
TI  - Tales from The Keepers Registry: Serial Issues About Archiving &amp; the Web
AU  - Burnhill, Peter
T2  - Serials Review
AB  - Abstract: A key task for libraries is to ensure access for their patrons to the scholarly statements now found across the Internet. Three stories reveal progress towards success in that task. The context of these stories is the shift from print to digital format for all types of continuing resources, particularly journals, and the need to archive not just serials but also ongoing ‘integrating resources’ such as databases and Web sites. The first story is about The Keepers Registry, an international initiative to monitor the extent of e-journal archiving. The second story is about the variety of ‘serial issues’ that have had to be addressed during the PEPRS (Piloting an E-journals Preservation Registry Service) project which was commissioned in the UK by JISC. These include identification, naming and identification of publishers, and the continuing need for a universal holdings statement. The role of the ISSN, and of the ISSN-L, has been a key. The third story looks beyond e-journals to new research objects and the dynamics of the Web, to the role of citation and fixity, and to broader matters of digital preservation. This story reflects upon seriality, as the Web becomes the principal arena and medium for scholarly discourse. Scientific discourse is now resident on the Web. Much that is issued on the Web is issued nowhere else: it is a digital native. Statistics that indicate the extent of archiving for e-journals to which major university libraries subscribe are also included in the article. [Copyright &y& Elsevier]
DA  - 2013/03//
PY  - 2013
DO  - 10.1016/j.serrev.2013.02.003
VL  - 39
IS  - 1
SP  - 3
EP  - 20
SN  - 00987913
UR  - http://10.0.3.248/j.serrev.2013.02.003
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://linkinghub.elsevier.com/retrieve/pii/S0098791313000178
KW  - WEB archiving
KW  - Archiving
KW  - WEBSITES
KW  - Web
KW  - Preservation
KW  - ISSN
KW  - Citation
KW  - DATABASES
KW  - ELECTRONIC information resources
KW  - ELECTRONIC journals
KW  - Identifiers
KW  - LIBRARIES
KW  - TALE (Literary form)
ER  - 

TY  - JOUR
TI  - A UWS Case for 200-Style Memento Negotiations ; Bulletin of IEEE Technical Committee on Digital Libraries
AU  - Xie, Zhiwu
AB  - Uninterruptible web service (UWS) is a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the websites quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing value-added support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.
DA  - 2015///
PY  - 2015
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - Memento
KW  - Uninterruptible web service
ER  - 

TY  - JOUR
TI  - The Importance of Web Archives for Humanities
AU  - Gomes, Daniel
AU  - Costa, Miguel
T2  - International Journal of Humanities and Arts Computing
AB  - The web is the primary means of communication in developed societies. It contains descriptions of recent events generated through distinct perspectives. Thus, the web is a valuable resource for contemporary historical research. However, its information is extremely ephemeral. Several research studies have shown that only a small amount of information remains available on the web for longer than one year. Web archiving aims to acquire, preserve and provide access to historical information published online. In April 2013, there were at least sixty four web archiving initiatives worldwide. Altogether, these archived collections of web documents form a comprehensive picture of our cultural, commercial, scientific and social history. Web archiving has also an important sociological impact because ordinary citizens are publishing personal information online without preservation concerns. In the future, web archives will probably be the only source of personal memories to many people. We provide some examples of tools that facilitate historical research over web archives highlighting their potential for Humanities. [ABSTRACT FROM AUTHOR]
DA  - 2014/04//
PY  - 2014
DO  - 10.3366/ijhac.2014.0122
VL  - 8
IS  - 1
SP  - 106
EP  - 123
SN  - 1753-8548
UR  - http://10.0.13.38/ijhac.2014.0122
L4  - http://www.euppublishing.com/doi/abs/10.3366/ijhac.2014.0122
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - Digital Preservation
KW  - Web Archiving
KW  - WEB archives
KW  - Digital Humanities
KW  - DIGITAL humanities
KW  - HISTORY -- Methodology
KW  - HISTORY & technology
ER  - 

TY  - RPRT
TI  - Nearline Web Archiving
AU  - Xie, Zhiwu
AU  - Nayyar, Krati
AU  - Fox, Edward A.
AU  - ​
AU  - 3
AB  - In this paper, we propose a modified approach to realtime transactional web archiving. It leverages the web caching infrastructure that is already prevalent on web servers. Instead of archiving web content at HTTP transaction time, in our approach the archiving happens when the cached copy expires and is about to be expunged. Before the deletion, all expired cache copies are combined and then sent to the web archive in small batches. Since the cache is purged at much lower frequency than HTTP transactions, the archival workload is also much lower than that for transactional archiving. To further decrease the processing load at the origin server, archival copy deduplication is carried out at the archive instead of at the origin server. It is crucial to note that the cache purging process is separate from those that serve the HTTP requests. It can be, and usually is set to lower priority. The archiving therefore occurs only when the server is not busy fulfilling its more mission critical tasks; this is much less disruptive to the origin server. This approach, however, does not guarantee that the freshest copy is archived, although the cache purging policy may be adjusted to attempt to bound the freshness of the archive.
CY  - United States, North America
DA  - 2016///
PY  - 2016
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://hdl.handle.net/10919/71648
KW  - Web archiving
KW  - Apache web server
KW  - Nearline web archiving
KW  - Web cache
ER  - 

TY  - JOUR
TI  - From a System of Journals to a Web of Objects
AU  - Van de Sompel, Herbert
AU  - Davis, Susan
T2  - The Serials Librarian
AB  - The article focuses on the web-based research process presented by Herbert Van de Sompel, Prototyping Team Leader at the Research Library of the Los Alamos National Laboratory in New Mexico, in which he explored the transition from a paper-based system to a web-based scholarly communication system. Topics discussed include de Sompel's current and ongoing projects, the core functions of the scholarly communication system, and the possibility of a long-term access to the scholarly record.
DA  - 2015/05/19/
PY  - 2015
DO  - 10.1080/0361526X.2015.1026748
VL  - 68
IS  - 1-4
SP  - 51
EP  - 63
SN  - 0361-526X
UR  - http://10.0.4.56/0361526X.2015.1026748
L4  - http://www.tandfonline.com/doi/full/10.1080/0361526X.2015.1026748
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - ACCESS to information
KW  - ARCHIVES
KW  - WORLD Wide Web
KW  - link rot
KW  - scholarly communication
KW  - LEARNING & scholarship
KW  - INFORMATION resources management
KW  - reference rot
KW  - SERIAL publications
KW  - web of objects
ER  - 

TY  - JOUR
TI  - Archiving the Russian and East European Lesbian, Gay, Bisexual, and Transgender Web, 2013: A Pilot Project
AU  - Pendse, Liladhar R
T2  - Slavic & East European Information Resources
AB  - This article focuses on the conceptualization and implementation of a web archiving pilot project of selected Russian and East European lesbian, gay, bisexual, and transgender (LGBT) websites by the University of California, Berkeley. It introduces the use of the Web Archiving Services (WAS) platform developed by the California Digital Library. While identifying the criteria used to harvest these websites, the paper also describes various complexities associated with the viability of projects related to such complex social and political issues as the Russian and Eastern European LGBT rights movements. The article does not take an ideological stance with respect to legal issues, but rather strives to preserve information for academic research. [ABSTRACT FROM AUTHOR]
DA  - 2014/07/03/
PY  - 2014
DO  - 10.1080/15228886.2014.930973
VL  - 15
IS  - 3
SP  - 182
EP  - 196
SN  - 1522-8886
UR  - http://10.0.4.56/15228886.2014.930973
L4  - http://www.tandfonline.com/doi/abs/10.1080/15228886.2014.930973
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - WEB archives
KW  - websites
KW  - Russia
KW  - INFORMATION storage & retrieval systems
KW  - Berkeley
KW  - bisexual and transgender web
KW  - California Digital Library
KW  - CATALOGING of archival materials
KW  - East Europe
KW  - Eastern Europe
KW  - gay
KW  - lesbian
KW  - LGBT
KW  - LGBT websites
KW  - PILOT projects
KW  - University of California
KW  - UNIVERSITY research
ER  - 

TY  - RPRT
TI  - Nuove prospettive per il web archiving: gli standard ISO 28500 (formato WARC) e ISO/TR 14873 sulla qualità del web archiving
AU  - Allegrezza, Stefano
AB  - Il Web archiving è un argomento di forte attualità in quanto, come è noto, se non si individuano in breve tempo soluzioni efficaci e sostenibili nel lungo periodo, si rischia di perdere per sempre quello che si è prodotto e pubblicato sul Web negli ultimi venti-trenta anni, dal momento che tale materiale è caratterizzato da un’estrema mutevolezza e dinamicità e spesso interi siti Web cambiano o scompaiono nel giro di poco tempo. Le soluzioni che sono state proposte fino ad oggi sono parziali e non sempre hanno raggiunto l’obiettivo. Tuttavia, recentemente ci sono state due novità che sembrerebbero poter assicurare prospettive migliori: si tratta da una parte della proposta di un formato elettronico specificatamente pensato per l’archiviazione del Web (il formato WARC), dall’altra della pubblicazione di una specifica norma ISO dedicata alla qualità nella conservazione del Web (ISO/TR 14873:2013). La rilevanza dell’argomento per il settore dei beni culturali è tale che è opportuno fare un po’ di chiarezza su queste tematiche analizzando sia lo stato dell’arte che le prospettive future.
CY  - Italy, Europe
DA  - 2015///
PY  - 2015
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - digital preservation
KW  - WARC
KW  - archiviazione del web
KW  - conservazione digitale
ER  - 

TY  - JOUR
TI  - Web Archiving in the UK: Cooperation, Legislation and Regulation
AU  - Tuck, John
T2  - Liber Quarterly: The Journal of European Research Libraries, Vol 18, Iss 3-4, Pp 357-365 (2008) VO - 18
AB  - The author presents an overview of web archiving in an international context, focussing on web archiving initiatives in the United Kingdom from 2001 onwards.
DA  - 2008///
PY  - 2008
DO  - 10.18352/lq.7935
IS  - 3-4
SP  - 357
SN  - 2213-056X
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://doaj.org/article/761b591fbcef40daa42a34625b27beaa
L4  - https://www.liberquarterly.eu/articles/10.18352/lq.7935/
KW  - Web archiving
KW  - legal deposit
KW  - Bibliography. Library science. Information resourc
KW  - UKWAC
KW  - United Kingdom
ER  - 

TY  - JOUR
TI  - Building a Living, Breathing Archive:A Review of Appraisal Theories and Approaches for Web Archives
AU  - Post, Colin
T2  - Preservation, Digital Technology & Culture
AB  - The paper provides a review of published literature on the collection and development of Web archives, focusing specifically on the theories, techniques, tools, and approaches used to appraise Web-based materials for inclusion in collections. Facing an enormous amount of Web-based materials, archival institutions and other cultural heritage institutions need to devise methods to actively select Webpages for preservation, creating Web archives that constitute a cultural record of the Web for the benefit of users. This review outlines the challenges of collecting and appraising Web-based materials, places the theories and activities of collecting Web-based materials within the broader discourse of archival appraisal, and points out directions for future research and critical discourse for Web archives.
DA  - 2017///
PY  - 2017
DO  - http://dx.doi.org/10.1515/pdtc-2016-0031
VL  - 46
IS  - 2
SP  - 69
EP  - 77
LA  - English
SN  - 21952957
UR  - https://search.proquest.com/docview/1940603266?accountid=27464
KW  - Web archiving
KW  - web archiving
KW  - web archives
KW  - Archives
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
KW  - appraisal
KW  - Archival appraisal
KW  - Cultural resources
KW  - Literature reviews
ER  - 

TY  - JOUR
TI  - Challenges of archiving and preserving born-digital news applications
AU  - Boss, Katherine
AU  - Broussard, Meredith
T2  - IFLA Journal
AB  - Born-digital news content is increasingly becoming the format of the first draft of history. Archiving and preserving this history is of paramount importance to the future of scholarly research, but many technical, legal, financial, and logistical challenges stand in the way of these efforts. This is especially true for news applications, or custom-built websites that comprise some of the most sophisticated journalism stories today, such as the "Dollars for Docs" project by ProPublica. Many news applications are standalone pieces of software that query a database, and this significant subset of apps cannot be archived in the same way as text-based news stories, or fully captured by web archiving tools such as Archive-It. As such, they are currently disappearing. This paper will outline the various challenges facing the archiving and preservation of born-digital news applications, as well as outline suggestions for how to approach this important work.
DA  - 2017/06//
PY  - 2017
DO  - http://dx.doi.org/10.1177/0340035216686355
VL  - 43
IS  - 2
SP  - 150
EP  - 157
LA  - English
SN  - 0340-0352
UR  - https://search.proquest.com/docview/1900646766?accountid=27464
L4  - http://journals.sagepub.com/doi/abs/10.1177/0340035216686355
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archiving
KW  - web archiving
KW  - Library And Information Sciences
KW  - 3.2:ARCHIVES
KW  - Born-digital news
KW  - Computer software
KW  - Journalism
KW  - news applications
KW  - News coverage
KW  - Preservation
KW  - Scholarly publishing
KW  - software preservation
KW  - TCP-IP
ER  - 

TY  - JOUR
TI  - Analysing and Enriching Focused Semantic Web Archives for Parliament Applications
AU  - Demidova, Elena
AU  - Barbieri, Nicola
AU  - Dietze, Stefan
AU  - Funk, Adam
AU  - Holzmann, Helge
AU  - Maynard, Diana
AU  - Papailiou, Nikolaos
AU  - Peters, Wim
AU  - Risse, Thomas
AU  - Spiliotopoulos, Dimitris
T2  - Future Internet, Vol 6, Iss 3, Pp 433-456 (2014) VO - 6
AB  - The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
DA  - 2014///
PY  - 2014
DO  - 10.3390/fi6030433
IS  - 3
SP  - 433
SN  - 1999-5903
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archiving
KW  - Information technology
KW  - T58.5-58.64
KW  - enrichment
KW  - entity and event extraction
KW  - parliament libraries
KW  - semantic content analysis
KW  - topic detection
ER  - 

TY  - JOUR
TI  - Your digital legacy.
AU  - Paul-Choudhury, Sumit
T2  - New Scientist
AB  - The article discusses how individual's digital legacies, or the collection of posts from social networking websites, are being stored long term. The article notes that while Internet companies like Google store people's information on servers for research and advertising purposes, some historians feel this kind of digital preservation is not permanent enough, and caution individuals not to trust corporations to save this data. The article notes archive methods are being researched.
DA  - 2011/04/23/
PY  - 2011
VL  - 210
IS  - 2809
SP  - 40
EP  - 43
SN  - 02624079
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - DATA protection
KW  - GOOGLE Inc.
KW  - PRESERVATION of materials
ER  - 

TY  - JOUR
TI  - Named entity evolution recognition on the Blogosphere.
AU  - Holzmann, Helge
AU  - Tahmasebi, Nina
AU  - Risse, Thomas
T2  - International Journal on Digital Libraries
AB  - Advancements in technology and culture lead to changes in our language. These changes create a gap between the language known by users and the language stored in digital archives. It affects user's possibility to firstly find content and secondly interpret that content. In a previous work, we introduced our approach for named entity evolution recognition (NEER) in newspaper collections. Lately, increasing efforts in Web preservation have led to increased availability of Web archives covering longer time spans. However, language on the Web is more dynamic than in traditional media and many of the basic assumptions from the newspaper domain do not hold for Web data. In this paper we discuss the limitations of existing methodology for NEER. We approach these by adapting an existing NEER method to work on noisy data like the Web and the Blogosphere in particular. We develop novel filters that reduce the noise and make use of Semantic Web resources to obtain more information about terms. Our evaluation shows the potentials of the proposed approach. [ABSTRACT FROM AUTHOR]
DA  - 2015/04//
PY  - 2015
VL  - 15
IS  - 2-4
SP  - 209
EP  - 235
SN  - 14325012
UR  - http://10.0.3.239/s00799-014-0135-x
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - BLOGS
KW  - Blogs
KW  - DBpedia
KW  - Named entity evolution
KW  - Semantic Web
KW  - SEMANTIC Web
KW  - WEB databases
ER  - 

TY  - JOUR
TI  - When should I make preservation copies of myself?
AU  - Cartledge, Charles
AU  - Nelson, Michael
T2  - International Journal on Digital Libraries
AB  - We investigate how different replication policies ranging from least aggressive to most aggressive affect the level of preservation achieved by autonomic processes used by web objects (WOs). Based on simulations of small-world graphs of WOs created by the Unsupervised Small-World algorithm, we report quantitative and qualitative results for graphs ranging in order from 10 to 5000 WOs. Our results show that a moderately aggressive replication policy makes the best use of distributed host resources by not causing spikes in CPU resources nor spikes in network activity while meeting preservation goals. We examine different approaches that WOs can communicate with each other and determine the how long it would take for a message from one WO to reach a specific WO, or all WOs. [ABSTRACT FROM AUTHOR]
DA  - 2015/09//
PY  - 2015
VL  - 16
IS  - 3/4
SP  - 183
EP  - 205
SN  - 14325012
UR  - http://10.0.3.239/s00799-015-0155-1
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - DIGITAL preservation
KW  - WEB archiving
KW  - DIGITAL libraries
KW  - WEB archives
KW  - Preservation
KW  - CENTRAL processing units
KW  - Crowd sourcing
KW  - INFORMATION storage & retrieval systems
KW  - Small-world
KW  - Web object
ER  - 

TY  - JOUR
TI  - The Future of Web Citation Practices
AU  - Davis, Robin Camille
T2  - Behavioral & Social Sciences Librarian
AB  - Citing webpages has been a common practice in scholarly publications for nearly two decades as the Web evolved into a major information source. But over the years, more and more bibliographies have suffered from "reference rot": Cited URLs are broken links or point to a page that no longer contains the content the author originally cited. In this column, I look at several studies showing how reference rot has affected different academic disciplines. I also examine citation styles' approach to citing Web sources. I then turn to emerging Web citation practices: Perma, a "freemium" Web archiving service specifically for citation; and the Internet Archive, the largest Web archive.
DA  - 2016/07//
PY  - 2016
DO  - http://dx.doi.org/10.1080/01639269.2016.1241122
VL  - 35
IS  - 3
SP  - 128
EP  - 134
LA  - English
SN  - 0163-9269
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://www.tandfonline.com/doi/abs/10.1080/01639269.2016.1241122
L4  - https://search.proquest.com/docview/1845846795?accountid=27464
KW  - Web archiving
KW  - ARCHIVES
KW  - WORLD Wide Web
KW  - Digital archives
KW  - Library And Information Sciences
KW  - BIBLIOGRAPHICAL citations
KW  - BIBLIOGRAPHY (Documentation)
KW  - DATA security
KW  - LEARNING & scholarship
ER  - 

TY  - JOUR
TI  - Collecting Digital Content at the Library of Congress.
AU  - LOC Library Services Collection Development Office
T2  - Digital Publishing Report
AB  - In January 2017, the Library of Congress adopted a set of strategic steps related to its future acquisition of digital content. The purpose of this document is to provide background information and a high-level description of the strategy. The Library has been steadily increasing its digital collecting capacity and capability over the past two decades. This has come as the product of numerous independent efforts pointed to the same goal – acquire as much selected digital content as technically possible and make that content as broadly accessible to users as possible. In the past few years, much progress has been made, and an impressive amount of content has been acquired through several acquisitions methods. Further expansion of the Library’s digital collecting program is seen as an essential part of the institution’s strategic goal to: Acquire, preserve, and provide access to a universal collection of knowledge and the record of America’s creativity. The scope of the newly-adopted strategy is limited to actions directly involved with acquisitions and collecting. It does not cover other related actions that are essential to a successful digital collections program. These primarily include the following. • Further development of the Library’s technical infrastructure • Development of various access policies and procedures appropriate to different categories of digital content • Preservation of acquired digital content • Training and development of staff • Eventual realignment of resources to match an environment where a greater portion of the Library’s collection building program focuses on digital materials The strategy also does not cover digitization, which is the process by which the Library’s physical collections materials (printed text, images, sound on tangible formats, etc.) are converted into digital formats that can be stored and accessed via a computer.
DA  - 2017/03/20/
PY  - 2017
VL  - 5
IS  - 11
SP  - 2
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L1  - https://www.loc.gov/acq/devpol/CollectingDigitalContent.pdf
L1  - https://www.loc.gov/acq/devpol/CollectingDigitalContent.pdf?loclr=blogsig
KW  - WEB archiving
KW  - DATA transmission systems
KW  - LIBRARIES & publishing
KW  - LIBRARY acquisitions
KW  - LIBRARY of Congress
ER  - 

TY  - JOUR
TI  - A Method for Identifying Personalized Representations in Web Archives
AU  - Kelly, Mat
AU  - Brunelle, Justin F
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - D-Lib Magazine
AB  - Web resources are becoming increasingly personalized - two different users clicking on the same link at the same time can see content customized for each individual user. These changes result in multiple representations of a resource that cannot be canonicalized in Web archives. We identify characteristics of this problem by presenting a potential solution to generalize personalized representations in archives. We also present our proof-of-concept prototype that analyzes WARC (Web ARChive) format files, inserts metadata establishing relationships, and provides archive users the ability to navigate on the additional dimension of environment variables in a modified Wayback Machine. Adapted from the source document.
DA  - 2013/11//
PY  - 2013
DO  - http://dx.doi.org/10.1045/november2013-kelly
VL  - 19
IS  - 11-12
LA  - English
SN  - 1082-9873, 1082-9873
UR  - https://search.proquest.com/docview/1622284455?accountid=27464
L4  - http://www.dlib.org/dlib/november13/kelly/11kelly.html
KW  - Web archiving
KW  - Web sites
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Customization
KW  - Methods
ER  - 

TY  - JOUR
TI  - Internet Archive, Reed Tech Agree
AU  - Duke, Judy
T2  - Advanced Technology Libraries
AB  - Internet Archive and Reed Technology and Information Services Inc., part of the LexisNexis family, have agreed to jointly market and sell Internet Archives Archive-It service and continue to support the growing community of organizations currently using the service. First launched at Internet Archive in early 2006, Archive-It has been providing a sophisticated and flexible solution to a broad range of organizations and institutions focused on creating and managing collections of Web content. Adapted from the source document.
DA  - 2013/12//
PY  - 2013
VL  - 42
IS  - 12
SP  - 6
EP  - 7
LA  - English
SN  - 0044-636X, 0044-636X
UR  - https://search.proquest.com/docview/1622279345?accountid=27464
KW  - Collaboration
KW  - Web archiving
KW  - Marketing
KW  - article
KW  - 13.1: INFORMATION STORAGE AND RETRIEVAL - ECONOMIC
KW  - Information industry
ER  - 

TY  - JOUR
TI  - Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations
AU  - Zittrain, Jonathan
AU  - Albert, Kendra
AU  - Lessig, Lawrence
T2  - Legal Information Management
AB  - Abstract It has become increasingly common for a reader to follow a URL cited in a court opinion or a law review article, only to be met with an error message because the resource has been moved from its original online address. This form of reference rot, commonly referred to as 'linkrot', has arisen from the disconnect between the transience of online materials and the permanence of legal citation, and will only become more prevalent as scholarly materials move online. The present paper*, written by Jonathan Zittrain, Kendra Albert and Lawrence Lessig, explores the pervasiveness of linkrot in academic and legal citations, finding that more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information. In light of these results, a solution is proposed for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents. [PUBLICATION ABSTRACT]
DA  - 2014/06//
PY  - 2014
DO  - http://dx.doi.org/10.1017/S1472669614000255
VL  - 14
IS  - 2
SP  - 88
EP  - 99
LA  - English
SN  - 14726696
UR  - https://search.proquest.com/docview/1535097054?accountid=27464
KW  - web archiving
KW  - link rot
KW  - Library And Information Sciences
KW  - websites
KW  - legal citations
ER  - 

TY  - JOUR
TI  - Gathering the 'Net: Efforts and Challenges in Archiving Pacific Websites
AU  - Kleiber, Eleanor
T2  - The Contemporary Pacific
AB  - In addition to more traditional material -- books, journals and other serial publications, brochures, music, films, manuscripts, photographs, postcards and archives -- the University of Hawai'i-Manoa (UHM) Library's Hawaiian and Pacific Collections are now actively collecting websites. With so many new websites being created in and about the Pacific Islands region, and so much more information being made available online -- and at times exclusively so -- it has become increasingly clear to the librarians of these collections that to adequately document this period in history it is necessary to collect and preserve websites. The UHM Library has been attempting to archive websites in one form or another since 2001. This essay will discuss the importance of collecting Pacific websites, describe how the Hawaiian and Pacific Collections are finding solutions for the inherent challenges of preserving websites, and explore some potential future directions that would strengthen the project and meet the information and research needs of the Pacific Islands region. Adapted from the source document.
DA  - 2014///
PY  - 2014
DO  - 10.1353/cp.2014.0017
VL  - 26
IS  - 1
SP  - 158
EP  - 166
LA  - English
SN  - 1043-898X, 1043-898X
UR  - https://search.proquest.com/docview/1629324578?accountid=27464
KW  - Web archiving
KW  - Web sites
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
KW  - Pacific Region
KW  - University libraries
ER  - 

TY  - JOUR
TI  - Understanding Service Quality and System Quality Success Factors in Cloud Archiving from an End-User Perspective
AU  - Burda, Daniel
AU  - Teuteberg, Frank
T2  - Information Systems Management
AB  - This study seeks to explain the adoption of cloud storage services as a means of personal archiving thereby focusing on users' service and system quality perceptions and their drivers. The authors derive and empirically validate a model that incorporates users' perceptions of service/system quality as well as behavioral factors to explain usage. Finally, the authors highlight important determinants of system/service quality perceptions that cloud providers should pay attention to in their attempts to increase marketshare.
DA  - 2015///
PY  - 2015
DO  - http://dx.doi.org/10.1080/10580530.2015.1079998
VL  - 32
IS  - 4
SP  - 266
EP  - 284
LA  - English
SN  - 1058-0530
UR  - https://search.proquest.com/docview/1784145700?accountid=27464
KW  - Web archiving
KW  - Library And Information Sciences--Computer Applica
KW  - 13.11:INFORMATION STORAGE AND RETRIEVAL - NETWORKS
KW  - adoption
KW  - cloud archiving
KW  - Cloud computing
KW  - cloud storage
KW  - Customer satisfaction
KW  - Quality of service
KW  - service quality
KW  - system quality
KW  - technology acceptance model
KW  - Users
ER  - 

TY  - JOUR
TI  - Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs
AU  - Dougherty, Meghan
AU  - Meyer, Eric T
T2  - Journal of the Association for Information Science and Technology
AB  - The web encourages the constant creation and distribution of large amounts of information; it is also a valuable resource for understanding human behavior and communication. To take full advantage of the web as a research resource that extends beyond the consideration of snapshots of the present, however, it is necessary to begin to take web archiving much more seriously as an important element of any research program involving web resources. The ephemeral character of the web requires that researchers take proactive steps in the present to enable future analysis. Efforts to archive the web or portions thereof have been developed around the world, but these efforts have not yet provided reliable and scalable solutions. This article summarizes the current state of web archiving in relation to researchers and research needs. Interviews with researchers, archivists, and technologists identify the differences in purpose, scope, and scale of current web archiving practice, and the professional tensions that arise given these differences. Findings outline the challenges that still face researchers who wish to engage seriously with web content as an object of research, and archivists who must strike a balance reflecting a range of user needs. [Copyright Wiley Periodicals Inc.]
DA  - 2014/11//
PY  - 2014
DO  - http://dx.doi.org/10.1002/asi.23099
VL  - 65
IS  - 11
SP  - 2195
EP  - 2209
LA  - English
SN  - 2330-1635, 2330-1635
UR  - https://search.proquest.com/docview/1700661485?accountid=27464
KW  - Web archiving
KW  - Digital preservation
KW  - Research
KW  - 9.15: TECHNICAL SERVICES - PRESERVATION
KW  - article
ER  - 

TY  - JOUR
TI  - InZeit: Efficiently Identifying Insightful Time Points
AU  - Setty, Vinay
AU  - Bedathur, Srikanta
AU  - Berberich, Klaus
AU  - Weikum, Gerhard
T2  - Proc. VLDB Endow.
AB  - Web archives are useful resources to find out about the temporal evolution of persons, organizations, products, or other topics. However, even when advanced text search functionality is available, gaining insights into the temporal evolution of a topic can be a tedious task and often requires sifting through many documents. The demonstrated system named InZeit (pronounced "insight") assists users by determining insightful time points for a given query. These are the time points at which the top-k time-travel query result changes substantially and for which the user should therefore inspect query results. InZeit determines the m most insightful time points efficiently using an extended segment tree for in-memory bookkeeping.
DA  - 2010///
PY  - 2010
DO  - 10.14778/1920841.1921050
VL  - 3
IS  - 1-2
SP  - 1605
EP  - 1608
SN  - 2150-8097
UR  - http://dx.doi.org/10.14778/1920841.1921050
ER  - 

TY  - CONF
TI  - Determining Users' Motivations to Participate in Online Community Archives: A Preliminary Study of Documenting Ferguson
AU  - Freeland, Chris
AU  - Atiso, Kodjo
AB  - The shooting death of teenager Michael Brown in Ferguson, Missouri, spurred an immediate national and international response in the fall of 2014. Washington University Libraries in St. Louis, Missouri, established the Documenting Ferguson web archive to gather digital media documenting local protests and demonstrations as captured by community members in order to archive the materials for future research and scholarly use. This preliminary study identified the factors that motivated participants to contribute content to the Documenting Ferguson online community archive, uncovering themes of altruism, reciprocity, and personal development.
C1  - Silver Springs, MD, USA
C3  - Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community
DA  - 2015///
PY  - 2015
SP  - 106:1
EP  - 106:4
PB  - American Society for Information Science
SN  - 0-87715-547-X
UR  - http://dl.acm.org/citation.cfm?id=2857070.2857176
KW  - human-computer interaction
KW  - motivation
KW  - participatory archives
ER  - 

TY  - CONF
TI  - What's Really New on the Web?: Identifying New Pages from a Series of Unstable Web Snapshots
AU  - Toyoda, Masashi
AU  - Kitsuregawa, Masaru
C1  - New York, NY, USA
C3  - Proceedings of the 15th International Conference on World Wide Web
DA  - 2006///
PY  - 2006
DO  - 10.1145/1135777.1135815
SP  - 233
EP  - 241
PB  - ACM
SN  - 1-59593-323-9
UR  - http://doi.acm.org/10.1145/1135777.1135815
KW  - information retrieval
KW  - link analysis
KW  - novelty
KW  - web evolution
ER  - 

TY  - CONF
TI  - A System for Visualizing and Analyzing the Evolution of the Web with a Time Series of Graphs
AU  - Toyoda, Masashi
AU  - Kitsuregawa, Masaru
AB  - We propose WebRelievo, a system for visualizing and analyzing the evolution of the web structure based on a large Web archive with a series of snapshots. It visualizes the evolution with a time series of graphs, in which nodes are web pages, and edges are relationships between pages. Graphs can be clustered to show the overview of changes in graphs. WebRelievo aligns these graphs according to their time, and automatically determines their layout keeping positions of nodes synchronized over time, so that the user can keep track pages and clusters. This visualization enables us to understand when pages appeared, how their relationships have evolved, and how clusters are merged and split over time. Current implementation of WebRelievo is based on six Japanese web archives crawled from 1999 to 2003. The user can interactively browse those graphs by changing the focused page and by changing layouts of graphs. Using WebRelievo we can answer historical questions, and to investigate changes in trends on the Web. We show the feasibility of WebRelievo by applying it to tracking trends in P2P systems and search engines for mobile phones, and to investigating link spamming.
C1  - New York, NY, USA
C3  - Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia
DA  - 2005///
PY  - 2005
DO  - 10.1145/1083356.1083387
SP  - 151
EP  - 160
PB  - ACM
SN  - 1-59593-168-6
UR  - http://doi.acm.org/10.1145/1083356.1083387
KW  - visualization
KW  - link analysis
KW  - evolution
KW  - link spamming
KW  - Web graph
ER  - 

TY  - CONF
TI  - FluxCapacitor: Efficient Time-travel Text Search
AU  - Berberich, Klaus
AU  - Bedathur, Srikanta
AU  - Neumann, Thomas
AU  - Weikum, Gerhard
AB  - An increasing number of temporally versioned text collections is available today with Web archives being a prime example. Search on such collections, however, is often not satisfactory and ignores their temporal dimension completely. Time-travel text search solves this problem by evaluating a keyword query on the state of the text collection as of a user-specified time point. This work demonstrates our approach to efficient time-travel text search and its implementation in the FLUXCAPACITOR prototype.
C3  - Proceedings of the 33rd International Conference on Very Large Data Bases
DA  - 2007///
PY  - 2007
SP  - 1414
EP  - 1417
PB  - VLDB Endowment
SN  - 978-1-59593-649-3
UR  - http://dl.acm.org/citation.cfm?id=1325851.1326029
ER  - 

TY  - CONF
TI  - What Happens when Facebook is Gone?
AU  - McCown, Frank
AU  - Nelson, Michael L
AB  - Web users are spending more of their time and creative energies within online social networking systems. While many of these networks allow users to export their personal data or expose themselves to third-party web archiving, some do not. Facebook, one of the most popular social networking websites, is one example of a "walled garden" where users' activities are trapped. We examine a variety of techniques for extracting users' activities from Facebook (and by extension, other social networking systems) for the personal archive and for the third-party archiver. Our framework could be applied to any walled garden where personal user data is being locked.
C1  - New York, NY, USA
C3  - Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2009///
PY  - 2009
DO  - 10.1145/1555400.1555440
SP  - 251
EP  - 254
PB  - ACM
SN  - 978-1-60558-322-8
UR  - http://doi.acm.org/10.1145/1555400.1555440
KW  - digital preservation
KW  - personal archiving
KW  - social networks
ER  - 

TY  - CONF
TI  - Building Entity-centric Event Collections
AU  - Nanni, Federico
AU  - Ponzetto, Simone Paolo
AU  - Dietz, Laura
AB  - Web archives preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by 1) identifying relevant concepts and entities from a knowledge base, and 2) detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record, and we test its performance on the TREC KBA Stream corpus, a large and publicly available web archive.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
DA  - 2017///
PY  - 2017
SP  - 199
EP  - 208
PB  - IEEE Press
SN  - 978-1-5386-3861-3
UR  - http://dl.acm.org/citation.cfm?id=3200334.3200356
ER  - 

TY  - CONF
TI  - Surfing Notes: An Integrated Web Annotation and Archiving Tool
AU  - He, Sisi
AU  - Chan, Edward
AB  - Web archiving for preserving the valuable information online from disappearing due to the dynamic nature of the World Wide Web, and web annotation for promoting the development of the Web as a two-way information sharing platform are both active research fields. However, in spite of their common benefits to information management and intelligent learning, few attempts have been made to integrate web archiving and web annotation. This paper introduces Surfing Notes, a cloud-based system which allows the users to annotate and archive the web pages for personal use. The change detection algorithm as well as the change detection interval scheduler are discussed in detail and evaluated experimentally.
C1  - Washington, DC, USA
C3  - Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
DA  - 2012///
PY  - 2012
DO  - 10.1109/WI-IAT.2012.174
SP  - 301
EP  - 305
PB  - IEEE Computer Society
SN  - 978-0-7695-4880-7
UR  - http://dx.doi.org/10.1109/WI-IAT.2012.174
L4  - https://dl.acm.org/citation.cfm?id=2457555
KW  - information retrieval
KW  - e-learning tool
KW  - web searching
ER  - 

TY  - CONF
TI  - Exploring the Past of the Web: Alexandria &#38; Archive-it Hackathon
AU  - Anand, Avishek
AU  - Bailey, Jefferson
AB  - The Web has pervaded all walks of life and has become an important corpus for studying the humanities, social sciences, and for use by computer scientists and other disciplines. Web archives collect, preserve, and provide ongoing access to ephemeral Web pages and hence encode traces of human thought, activity, and history. This makes them a valuable resource for analysis and study. However, there have been only few concerted efforts to bring together tools, platforms, storage, processing frameworks, and existing collections for mining and analysing Web archives.
C1  - New York, NY, USA
C3  - Proceedings of the 8th ACM Conference on Web Science
DA  - 2016///
PY  - 2016
DO  - 10.1145/2908131.2908212
SP  - 14
PB  - ACM
SN  - 978-1-4503-4208-7
UR  - http://doi.acm.org/10.1145/2908131.2908212
ER  - 

TY  - CONF
TI  - Tracking Entities in Web Archives: The LAWA Project
AU  - Spaniol, Marc
AU  - Weikum, Gerhard
AB  - Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on {\em entity-level analytics} in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over extended time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.
C1  - New York, NY, USA
C3  - Proceedings of the 21st International Conference on World Wide Web
DA  - 2012///
PY  - 2012
DO  - 10.1145/2187980.2188030
SP  - 287
EP  - 290
PB  - ACM
SN  - 978-1-4503-1230-1
UR  - http://doi.acm.org/10.1145/2187980.2188030
KW  - entity analytics
KW  - fire
KW  - temporal web analytics
ER  - 

TY  - CONF
TI  - A Study of Automation from Seed URL Generation to Focused Web Archive Development: The CTRnet Context
AU  - Yang, Seungwon
AU  - Chitturi, Kiran
AU  - Wilson, Gregory
AU  - Magdy, Mohamed
AU  - Fox, Edward A
AB  - In the event of emergencies and disasters, massive amounts of web resources are generated and shared. Due to the rapidly changing nature of those resources, it is important to start archiving them as soon as a disaster occurs. This led us to develop a prototype system for constructing archives with minimum human intervention using the seed URLs extracted from tweet collections. We present the details of our prototype system. We applied it to five tweet collections that had been developed in advance, for evaluation. We also identify five categories of non- relevant files and conclude with a discussion of findings from the evaluation.
C1  - New York, NY, USA
C3  - Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2012///
PY  - 2012
DO  - 10.1145/2232817.2232881
SP  - 341
EP  - 342
PB  - ACM
SN  - 978-1-4503-1154-0
UR  - http://doi.acm.org/10.1145/2232817.2232881
KW  - crawling
KW  - archiving
KW  - digital library
KW  - crisis tragedy and recovery network
KW  - seed URL generation
KW  - tweet
ER  - 

TY  - CONF
TI  - Intelligent Crawling of Web Applications for Web Archiving
AU  - Faheem, Muhammad
AB  - The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work on the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archival Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.
C1  - New York, NY, USA
C3  - Proceedings of the 21st International Conference on World Wide Web
DA  - 2012///
PY  - 2012
DO  - 10.1145/2187980.2187996
SP  - 127
EP  - 132
PB  - ACM
SN  - 978-1-4503-1230-1
UR  - http://doi.acm.org/10.1145/2187980.2187996
KW  - crawling
KW  - archiving
KW  - web application
KW  - extraction
KW  - xpath
ER  - 

TY  - CONF
TI  - Extracting Evolution of Web Communities from a Series of Web Archives
AU  - Toyoda, Masashi
AU  - Kitsuregawa, Masaru
AB  - Recent advances in storage technology make it possible to store a series of large Web archives. It is now an exciting challenge for us to observe evolution of the Web. In this paper, we propose a method for observing evolution of web communities. A web community is a set of web pages created by individuals or associations with a common interest on a topic. So far, various link analysis techniques have been developed to extract web communities. We analyze evolution of web communities by comparing four Japanese web archives crawled from 1999 to 2002. Statistics of these archives and community evolution are examined, and the global behavior of evolution is described. Several metrics are introduced to measure the degree of web community evolution, such as growth rate, novelty, and stability. We developed a system for extracting detailed evolution of communities using these metrics. It allows us to understand when and how communities emerged and evolved. Some evolution examples are shown using our system.
C1  - New York, NY, USA
C3  - Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
DA  - 2003///
PY  - 2003
DO  - 10.1145/900051.900059
SP  - 28
EP  - 37
PB  - ACM
SN  - 1-58113-704-4
UR  - http://doi.acm.org/10.1145/900051.900059
KW  - web
KW  - link analysis
KW  - evolution
KW  - web community
ER  - 

TY  - CONF
TI  - Hiberlink: Towards Time Travel for the Scholarly Web
AU  - Sanderson, Robert
AU  - de Sompel, Herbert
AU  - Burnhill, Peter
AU  - Grover, Claire
AB  - The preservation of traditional, digital scholarly output, such as PDF or HTML journal articles, is relatively well understood, and adequately organized through systems such as Portico and LoCKSS. However, the scholarly record is expanding with a wide variety of materials for which no established archival approaches exist. This includes, for example, workflows and software, project descriptions, demonstrations, datasets, and videos published on the web. Some of these resources are referenced in traditional papers and the lack of archival infrastructure yields a scholarly record with many loose ends. The Hiberlink project aims to quantify the extent to which such referenced resources are preserved in web archives, and propose solutions to ensure the longevity of the context of the research, along side the formal publication. The Hiberlink project regards the problem of preserving web resources referenced in scholarly papers as a special case of the more general problem of preserving scholarly compound objects, aka Research Objects, which consist of resources with a variety of relationships and dependencies.
C1  - New York, NY, USA
C3  - Proceedings of the 1st International Workshop on Digital Preservation of Research Methods and Artefacts
DA  - 2013///
PY  - 2013
DO  - 10.1145/2499583.2500370
SP  - 21
PB  - ACM
SN  - 978-1-4503-2185-3
UR  - http://doi.acm.org/10.1145/2499583.2500370
KW  - memento
KW  - preservation
KW  - web
KW  - research objects
KW  - repositories
ER  - 

TY  - CONF
TI  - Ranking Archived Documents for Structured Queries on Semantic Layers
AU  - Fafalios, Pavlos
AU  - Kasturia, Vaibhav
AU  - Nejdl, Wolfgang
AB  - Archived collections of documents (like newspaper and web archives) serve as important information sources in a variety of disciplines, including Digital Humanities, Historical Science, and Journalism. However, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into usable sources of information. A semantic layer is an RDF graph that describes metadata and semantic information about a collection of archived documents, which in turn can be queried through a semantic query language (SPARQL). This allows running advanced queries by combining metadata of the documents (like publication date) and content-based semantic information (like entities mentioned in the documents). However, the results returned by such structured queries can be numerous and moreover they all equally match the query. In this paper, we deal with this problem and formalize the task of ranking archived documents for structured queries on semantic layers. Then, we propose two ranking models for the problem at hand which jointly consider: i) the relativeness of documents to entities, ii) the timeliness of documents, and iii) the temporal relations among the entities. The experimental results on a new evaluation dataset show the effectiveness of the proposed models and allow us to understand their limitations.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3197049
SP  - 155
EP  - 164
PB  - ACM
SN  - 978-1-4503-5178-2
UR  - http://doi.acm.org/10.1145/3197026.3197049
KW  - archived documents
KW  - probabilistic modeling
KW  - ranking
KW  - semantic layers
KW  - stochastic modeling
ER  - 

TY  - CONF
TI  - Towards a Sustainable Crowdsourced Sound Heritage Archive by Public Participation: The Soundsslike Project
AU  - Yelmi, Pınar
AU  - Kuşcu, Hüseyin
AU  - Yantaç, Asım Evren
AB  - This paper explains how user-centered design approach shapes a cultural heritage project in the sustainability con - text. The project aims to protect urban sounds as intangible cultural heritage elements and turn the action of protecting sounds into a collaborative work. Sounds are of great sig - nificance in daily urban life and in culture as they carry emotions and awaken cultural memories. Thus, they de - serve to be protected and transferred to next generations. In this paper, we first evaluate soundscapes as an intangible cultural heritage element, second we explore the presenta - tion techniques in soundscape studies in the literature, then we explain how the methods implemented step by step, and finally we introduce the two outcomes: the library archive (The Soundscape of Istanbul project) and the crowdsourced web archive (The Soundsslike project). The Soundscape of Istanbul project aims to collect and archive cultural and urban sounds of the city while The Soundsslike project is basically a crowdsourced online sound archive which in - vites people to record symbolic urban sounds and upload them to the online sound archive. This online platform was built and displayed in an exhibition by means of an interac - tive tabletop interface to learn more from users and contrib - utors, and to enrich the archive content by raising public awareness of urban sounds
C1  - New York, NY, USA
C3  - Proceedings of the 9th Nordic Conference on Human-Computer Interaction
DA  - 2016///
PY  - 2016
DO  - 10.1145/2971485.2971492
SP  - 71:1
EP  - 71:9
PB  - ACM
SN  - 978-1-4503-4763-1
UR  - http://doi.acm.org/10.1145/2971485.2971492
KW  - Cultural heritage data
KW  - Design thinking
KW  - Digital culture
KW  - Human heritage interaction
KW  - Open archive
KW  - Participatory culture
KW  - Social networks & communities in cultural heritage
KW  - Sound archive visualization
KW  - Sustainability
ER  - 

TY  - CONF
TI  - SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
AU  - Theobald, Martin
AU  - Siddharth, Jonathan
AU  - Paepcke, Andreas
AB  - Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor natural-language portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
C1  - New York, NY, USA
C3  - Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
DA  - 2008///
PY  - 2008
DO  - 10.1145/1390334.1390431
SP  - 563
EP  - 570
PB  - ACM
SN  - 978-1-60558-164-4
UR  - http://doi.acm.org/10.1145/1390334.1390431
KW  - high-dimensional similarity search
KW  - inverted index pruning
KW  - optimal partitioning
KW  - stopword signatures
ER  - 

TY  - CONF
TI  - EventSearch: A System for Event Discovery and Retrieval on Multi-type Historical Data
AU  - Shan, Dongdong
AU  - Zhao, Wayne Xin
AU  - Chen, Rishan
AU  - Shu, Baihan
AU  - Wang, Ziqi
AU  - Yao, Junjie
AU  - Yan, Hongfei
AU  - Li, Xiaoming
AB  - We present EventSearch, a system for event extraction and retrieval on four types of news-related historical data, i.e., Web news articles, newspapers, TV news program, and micro-blog short messages. The system incorporates over 11 million web pages extracted from "Web InfoMall", the Chinese Web Archive since 2001. The newspaper and TV news video clips also span from 2001 to 2011. The system, upon a user query, returns a list of event snippets from multiple data sources. A novel burst model is used to discover events from time-stamped texts. In addition to offline event extraction, our system also provides online event extraction to further meet the user needs. EventSearch provides meaningful analytics that synthesize an accurate description of events. Users interact with the system by ranking the identified events using different criteria (scale, recency and relevance) and submitting their own information needs in different input fields.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
DA  - 2012///
PY  - 2012
DO  - 10.1145/2339530.2339781
SP  - 1564
EP  - 1567
PB  - ACM
SN  - 978-1-4503-1462-6
UR  - http://doi.acm.org/10.1145/2339530.2339781
KW  - event detection
KW  - event search
ER  - 

TY  - CONF
TI  - Carbon Dating the Web: Estimating the Age of Web Resources
AU  - SalahEldeen, Hany M
AU  - Nelson, Michael L
AB  - In the course of web research it is often necessary to estimate the creation datetime for web resources (in the general case, this value can only be estimated). While it is feasible to manually establish likely datetime values for small numbers of resources, this becomes infeasible if the collection is large. We present "carbon date", a simple web application that estimates the creation date for a URI by polling a number of sources of evidence and returning a machine-readable structure with their respective values. To establish a likely datetime, we poll bitly for the first time someone shortened the URI, topsy for the first time someone tweeted the URI, a Memento aggregator for the first time it appeared in a public web archive, Google's time of last crawl, and the Last-Modified HTTP response header of the resource itself. We also examine the backlinks of the URI as reported by Google and apply the same techniques for the resources that link to the URI. We evaluated our tool on a gold standard data set of 1200 URIs in which the creation date was manually verified. We were able to estimate a creation date for 75.90% of the resources, with 32.78% having the correct value. Given the different nature of the URIs, the union of the various methods produces the best results. While the Google last crawl date and topsy account for nearly 66% of the closest answers, eliminating the web archives or Last-Modified from the results produces the largest overall negative impact on the results. The carbon date application is available for download or use via a web API.
C1  - New York, NY, USA
C3  - Proceedings of the 22Nd International Conference on World Wide Web
DA  - 2013///
PY  - 2013
DO  - 10.1145/2487788.2488121
SP  - 1075
EP  - 1082
PB  - ACM
SN  - 978-1-4503-2038-2
UR  - http://doi.acm.org/10.1145/2487788.2488121
KW  - social media
KW  - memento
KW  - archiving
KW  - creation dates
ER  - 

TY  - CONF
TI  - Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets
AU  - Barik, Titus
AU  - Lubick, Kevin
AU  - Smith, Justin
AU  - Slankas, John
AU  - Murphy-Hill, Emerson
AB  - Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 12th Working Conference on Mining Software Repositories
DA  - 2015///
PY  - 2015
SP  - 486
EP  - 489
PB  - IEEE Press
SN  - 978-0-7695-5594-2
UR  - http://dl.acm.org/citation.cfm?id=2820518.2820594
ER  - 

TY  - CONF
TI  - What can history tell us?
AU  - Jatowt, Adam
AU  - Kawai, Yukiko
AU  - Ohshima, Hiroaki
AU  - Tanaka, Katsumi
AB  - The current Web is a dynamic collection where little effort is made to version pages or to enable users to access historical data. As a consequence, they generally do not have sufficient temporal support when browsing the Web. However, we think that there are many benefits to be obtained from integrating documents with their histories. For example, a document's history can enable us to travel back through time to establish its trustworthiness. This paper discusses the possible types of interactions that users could have with document histories and it presents several examples of systems that we have implemented for utilizing this historical data. To support our view, we present the results of an online survey conducted with the objective of investigating user needs for temporal support on the Web. Although the results indicated quite low use of Web archives by users, they simultaneously emphasized their considerable interest in page histories.
C1  - New York, New York, USA
C3  - Proceedings of the nineteenth ACM conference on Hypertext and hypermedia - HT '08
DA  - 2008///
PY  - 2008
DO  - 10.1145/1379092.1379098
SP  - 5
PB  - ACM Press
SN  - 978-1-59593-985-2
UR  - http://doi.acm.org/10.1145/1379092.1379098
L4  - http://portal.acm.org/citation.cfm?doid=1379092.1379098
KW  - archiving
KW  - document history
KW  - past web
KW  - time travel
KW  - versioning
ER  - 

TY  - CONF
TI  - Cross-lingual Web Spam Classification
AU  - Garzó, András
AU  - Daróczy, Bálint
AU  - Kiss, Tamás
AU  - Siklósi, Dávid
AU  - Benczúr, András A
AB  - While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.
C1  - New York, NY, USA
C3  - Proceedings of the 22Nd International Conference on World Wide Web
DA  - 2013///
PY  - 2013
DO  - 10.1145/2487788.2488139
SP  - 1149
EP  - 1156
PB  - ACM
SN  - 978-1-4503-2038-2
UR  - http://doi.acm.org/10.1145/2487788.2488139
KW  - content analysis
KW  - cross-lingual text processing
KW  - link analysis
KW  - web classification
KW  - web spam
ER  - 

TY  - CONF
TI  - Towards a Peer2Peer World-wide-web for the Broadband-enabled User Community
AU  - Mantratzis, Constantine
AU  - Orgun, Mehmet
AB  - This paper aims to study the concept of a distributed World Wide Web archive that complements the existing WWW and "lives" across a vast Peer-to-Peer network of broadband-connected user nodes. It proposes the sharing of a web browser's cached data with other peers in an effort to provide an alternative resource to "discontinued" web documents with [normally] short life spans such as video and audio content as well as frequently restructured text pages. We have based this study on the success of existing file-sharing Peer-to-Peer networks and aim to extend their use further to facilitate content-oriented usage more appropriately while at the same time, addressing some of the major problems that arise from this.
C1  - New York, NY, USA
C3  - Proceedings of the 2004 ACM Workshop on Next-generation Residential Broadband Challenges
DA  - 2004///
PY  - 2004
DO  - 10.1145/1026763.1026772
SP  - 42
EP  - 49
PB  - ACM
SN  - 1-58113-935-7
UR  - http://doi.acm.org/10.1145/1026763.1026772
KW  - distributed world wide web
KW  - peer 2 peer
ER  - 

TY  - CONF
TI  - Retrieving Broken Web Links Using an Approach Based on Contextual Information
AU  - Martinez-Romo, Juan
AU  - Araujo, Lourdes
AB  - In this short note we present a recommendation system for automatic retrieval of broken Web links using an approach based on contextual information. We extract information from the context of a link such as the anchor text, the content of the page containing the link, and a combination of the cache page in some search engine and web archive, if it exists. Then the selected information is processed and submitted to a search engine. We propose an algorithm based on information retrieval techniques to select the most relevant information and to rank the candidate pages provided for the search engine, in order to help the user to find the best replacement. To test the different methods, we have also defined a methodology which does not require the user judgements, what increases the objectivity of the results.
C1  - New York, NY, USA
C3  - Proceedings of the 20th ACM Conference on Hypertext and Hypermedia
DA  - 2009///
PY  - 2009
DO  - 10.1145/1557914.1557984
SP  - 351
EP  - 352
PB  - ACM
SN  - 978-1-60558-486-7
UR  - http://doi.acm.org/10.1145/1557914.1557984
KW  - information retrieval
KW  - broken links
KW  - link integrity
KW  - recommender system
ER  - 

TY  - CONF
TI  - Just-in-time Recovery of Missing Web Pages
AU  - Harrison, Terry L
AU  - Nelson, Michael L
AB  - We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.
C1  - New York, NY, USA
C3  - Proceedings of the Seventeenth Conference on Hypertext and Hypermedia
DA  - 2006///
PY  - 2006
DO  - 10.1145/1149941.1149971
SP  - 145
EP  - 156
PB  - ACM
SN  - 1-59593-417-0
UR  - http://doi.acm.org/10.1145/1149941.1149971
KW  - digital preservation
KW  - 404 web pages
KW  - apache web server
ER  - 

TY  - CONF
TI  - An Efficient Clustering Algorithm for Large-scale Topical Web Pages
AU  - Wang, Lei
AU  - Chen, Peng
AU  - Huang, Lian'en
AB  - The clustering of topic-related web pages has been recognized as a foundational work in exploiting large sets of web pages such as the cases in search engines and web archive systems, which collect and preserve billions of web pages. However, this task faces great challenges both in efficiency and accuracy. In this paper we present a novel clustering algorithm for large scale topical web pages which achieves high efficiency together with considerately high accuracy. In our algorithm, a two-phase divide and conquer framework is developed to solve the efficiency problem, in which both link analysis and content analysis are utilized in mining the topical similarity between pages to achieve a high accuracy. A comprehensive experiment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM Conference on Information and Knowledge Management
DA  - 2009///
PY  - 2009
DO  - 10.1145/1645953.1646247
SP  - 1851
EP  - 1854
PB  - ACM
SN  - 978-1-60558-512-3
UR  - http://doi.acm.org/10.1145/1645953.1646247
KW  - content analysis
KW  - link analysis
KW  - clustering
KW  - topic model
KW  - topical similarity
ER  - 

TY  - JOUR
TI  - A History of an Internet Exchange Point
AU  - Cardona Restrepo, Juan Camilo
AU  - Stanojevic, Rade
T2  - SIGCOMM Comput. Commun. Rev.
AB  - In spite of the tremendous amount of measurement efforts on understanding the Internet as a global system, little is known about the 'local' Internet (among ISPs inside a region or a country) due to limitations of the existing measurement tools and scarce data. In this paper, empirical in nature, we characterize the evolution of one such ecosystem of local ISPs by studying the interactions between ISPs happening at the Slovak Internet eXchange (SIX). By crawling the web archive waybackmachine.org we collect 158 snapshots (spanning 14 years) of the SIX website, with the relevant data that allows us to study the dynamics of the Slovak ISPs in terms of: the local ISP peering, the traffic distribution, the port capacity/utilization and the local AS-level traffic matrix. Examining our data revealed a number of invariant and dynamic properties of the studied ecosystem that we report in detail.
DA  - 2012///
PY  - 2012
DO  - 10.1145/2185376.2185384
VL  - 42
IS  - 2
SP  - 58
EP  - 64
SN  - 0146-4833
UR  - http://doi.acm.org/10.1145/2185376.2185384
KW  - internet exchange
KW  - internet traffic
KW  - peering
KW  - traffic matrix
ER  - 

TY  - CONF
TI  - Optimizing Positional Index Structures for Versioned Document Collections
AU  - He, JInru
AU  - Suel, Torsten
AB  - Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.
C1  - New York, NY, USA
C3  - Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
DA  - 2012///
PY  - 2012
DO  - 10.1145/2348283.2348319
SP  - 245
EP  - 254
PB  - ACM
SN  - 978-1-4503-1472-5
UR  - http://doi.acm.org/10.1145/2348283.2348319
KW  - index compression
KW  - inverted index
KW  - versioned documents
ER  - 

TY  - CONF
TI  - Global Web Archive Integration with Memento
AU  - Sanderson, Robert
AB  - In this poster, we describe the approach taken to designing and implementing a tera-scale multi-repository index of archived web resources using massively parallel processing.
C1  - New York, NY, USA
C3  - Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2012///
PY  - 2012
DO  - 10.1145/2232817.2232900
SP  - 379
EP  - 380
PB  - ACM
SN  - 978-1-4503-1154-0
UR  - http://doi.acm.org/10.1145/2232817.2232900
KW  - digital preservation
KW  - memento
KW  - high performance computing
ER  - 

TY  - CONF
TI  - Web Not for All: A Large Scale Study of Web Accessibility
AU  - Lopes, Rui
AU  - Gomes, Daniel
AU  - Carriço, Lu\'\is
AB  - The Web accessibility discipline strives for the study and improvement of front-end Web design towards people with disabilities. Best practices such as WCAG dictate how Web pages should be created accordingly. On top of WCAG, several evaluation procedures enable the measurement of the quality level of a Web page. We leverage these procedures in an automated evaluation of a nearly 30 million Web page collection provided by the Portuguese Web Archive. Our study shows that there is high variability regarding the accessibility level of Web pages, and that few pages reach high accessibility levels. The obtained results show that there is a correlation between accessibility and complexity (i.e., number of HTML elements) of a Web page. We have also verified the effect of the interpretation of evaluation warnings towards the perception of accessibility.
C1  - New York, NY, USA
C3  - Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A)
DA  - 2010///
PY  - 2010
DO  - 10.1145/1805986.1806001
SP  - 10:1
EP  - 10:4
PB  - ACM
SN  - 978-1-4503-0045-2
UR  - http://doi.acm.org/10.1145/1805986.1806001
KW  - automated evaluation
KW  - quality assessment
KW  - web accessibility
KW  - web characterisation
KW  - web science
ER  - 

TY  - CONF
TI  - Mapping the UK Webspace: Fifteen Years of British Universities on the Web
AU  - Hale, Scott A
AU  - Yasseri, Taha
AU  - Cowls, Josh
AU  - Meyer, Eric T
AU  - Schroeder, Ralph
AU  - Margetts, Helen
AB  - This paper maps the national UK web presence on the basis of an analysis of the .uk domain from 1996 to 2010. It reviews previous attempts to use web archives to understand national web domains and describes the dataset. Next, it presents an analysis of the .uk domain, including the overall number of links in the archive and changes in the link density of different second-level domains over time. We then explore changes over time within a particular second-level domain, the academic subdomain .ac.uk, and compare linking practices with variables, including institutional affiliation, league table ranking, and geographic location. We do not detect institutional affiliation affecting linking practices and find only partial evidence of league table ranking affecting network centrality, but find a clear inverse relationship between the density of links and the geographical distance between universities. This echoes prior findings regarding offline academic activity, which allows us to argue that real-world factors like geography continue to shape academic relationships even in the Internet age. We conclude with directions for future uses of web archive resources in this emerging area of research.
C1  - New York, NY, USA
C3  - Proceedings of the 2014 ACM Conference on Web Science
DA  - 2014///
PY  - 2014
DO  - 10.1145/2615569.2615691
SP  - 62
EP  - 70
PB  - ACM
SN  - 978-1-4503-2622-3
UR  - http://doi.acm.org/10.1145/2615569.2615691
KW  - big data
KW  - web archives
KW  - world wide web
KW  - academic web
KW  - hyperlink analysis
KW  - network analysis
ER  - 

TY  - CONF
TI  - The Past Issue of the Web
AU  - Hockx-Yu, Helen
AB  - This paper takes a critical look at the efforts since the mid-1990s in archiving and preserving websites by memory institutions around the world. It contains an overview of the approaches and practices to date, and a discussion of the various technical, curatorial and legal issues related to web archiving. It also looks at a number of current projects which take a different approach to dealing with the temporal aspects or persistence of the web. The paper argues for closer collaboration with the main stream web science research community and the use of technology developed for the live web, such as visualisation and data analytics, to advance the web archiving agenda.
C1  - New York, NY, USA
C3  - Proceedings of the 3rd International Web Science Conference
DA  - 2011///
PY  - 2011
DO  - 10.1145/2527031.2527050
SP  - 12:1
EP  - 12:8
PB  - ACM
SN  - 978-1-4503-0855-7
UR  - http://doi.acm.org/10.1145/2527031.2527050
KW  - web archiving
KW  - digital preservation
KW  - web archive
KW  - academic research and the web
KW  - digital libraries
KW  - electronic legal deposit
KW  - heritage
KW  - library information management
KW  - web harvesting
ER  - 

TY  - CONF
TI  - Sprint Methods for Web Archive Research
AU  - Huurdeman, Hugo C
AU  - Ben-David, Anat
AU  - Sammar, Thaer
AB  - Web archives provide access to snapshots of the Web of the past, and could be valuable for research purposes. However, access to these archives is often limited, both in terms of data availability, and interfaces to this data. This paper explores new methods to overcome these limitations. It presents "sprint-methods" for performing research using an archived collection of the Dutch news aggregator Website Nu.nl, and for developing and adapting a search system and interface to this data. The work aims to contribute to research in the humanities and social sciences, in particular New Media research employing digital methods to study the Web of the past. Secondly, this work aims to contribute to Computer Science, in the development of novel access tools for Web archives, that facilitate research.
C1  - New York, NY, USA
C3  - Proceedings of the 5th Annual ACM Web Science Conference
DA  - 2013///
PY  - 2013
DO  - 10.1145/2464464.2464513
SP  - 182
EP  - 190
PB  - ACM
SN  - 978-1-4503-1889-1
UR  - http://doi.acm.org/10.1145/2464464.2464513
KW  - web archives
KW  - information retrieval
KW  - digital methods
KW  - news analysis
KW  - search interface
KW  - temporal analysis
KW  - web collections
KW  - web history
ER  - 

TY  - CONF
TI  - From Web Archive to WebDigest: Concept and Examples
AU  - Xiaoming, Li
AU  - Lianen, Huang
AB  - Much like a black hole, the Web, since its birth, has been absorbing all sorts of data (information) around the globe, ever generated along the path of human civilization. On the other hand, the digitized and networked (webbed) nature of web data, which generally means "easy to access", gives rise to much imagination on re-discovering, re-engineering, and re-using of the oceanic information. Nevertheless, lunch is not free. The same time when we see the grand opportunities, tremendous challenges are ahead. In this talk, I'll first introduce Web InfoMall (http://www.infomall.cn), the Chinese web archive we have been constructing since 2001. Along with the activities, we observe some useful capabilities have been developed, such as large scale web crawling and very large scale data organization. In addition, we discuss a step beyond the WebArchive, called WebDigest, which is an effort aimed at making use of the data in the web archive. With a web archive and associated capability, "web mining" here has a more or less different meaning, which spans from the structure analysis of the web to named entity and relation extractions, from spatial (if we consider URL as a space) information discovery to temporal information exhibition. The main challenge for us is around the theme of achieving reasonably good performance with affordable cost. As we are from a university lab, the underlying question is: what can be done (and how) in a university lab environment with modest resource. After all, most of the researches started from university lab. We need to understand the feasibilities and compromises while seeing the promises.
C1  - Darlinghurst, Australia, Australia
C3  - Proceedings of the Nineteenth Conference on Australasian Database - Volume 75
DA  - 2007///
PY  - 2007
SP  - 11
PB  - Australian Computer Society, Inc.
SN  - 978-1-920682-56-9
UR  - http://dl.acm.org/citation.cfm?id=1378307.1378313
ER  - 

TY  - CONF
TI  - Visualizing Historical Content of Web Pages
AU  - Jatowt, Adam
AU  - Kawai, Yukiko
AU  - Tanaka, Katsumi
C1  - New York, NY, USA
C3  - Proceedings of the 17th International Conference on World Wide Web
DA  - 2008///
PY  - 2008
DO  - 10.1145/1367497.1367736
SP  - 1221
EP  - 1222
PB  - ACM
SN  - 978-1-60558-085-2
UR  - http://doi.acm.org/10.1145/1367497.1367736
KW  - web archive
KW  - past web
KW  - history summarization
KW  - page history visualization
ER  - 

TY  - CONF
TI  - Managing Duplicates in a Web Archive
AU  - Gomes, Daniel
AU  - Santos, André L
AU  - Silva, Mário J
AB  - Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1
C1  - New York, NY, USA
C3  - Proceedings of the 2006 ACM Symposium on Applied Computing
DA  - 2006///
PY  - 2006
DO  - 10.1145/1141277.1141465
SP  - 818
EP  - 825
PB  - ACM
SN  - 1-59593-108-2
UR  - http://doi.acm.org/10.1145/1141277.1141465
ER  - 

TY  - CONF
TI  - Prizm: A Wireless Access Point for Proxy-Based Web Lifelogging
AU  - Lin, Jimmy
AU  - Tu, Zhucheng
AU  - Rose, Michael
AU  - White, Patrick
AB  - We present Prizm, a prototype lifelogging device that comprehensively records a user's web activity. Prizm is a wireless access point deployed on a Raspberry Pi that is designed to be a substitute for the user's normal wireless access point. Prizm proxies all HTTP(S) requests from devices connected to it and records all activity it observes. Although this particular design is not entirely novel, there are a few features that are unique to our approach, most notably the physical deployment as a wireless access point. Such a package allows capture of activity from multiple devices, integration with web archiving for preservation, and support for offline operation. This paper describes the design of Prizm, the current status of our project, and future plans.
C1  - New York, NY, USA
C3  - Proceedings of the First Workshop on Lifelogging Tools and Applications
DA  - 2016///
PY  - 2016
DO  - 10.1145/2983576.2983581
SP  - 19
EP  - 25
PB  - ACM
SN  - 978-1-4503-4517-0
UR  - http://doi.acm.org/10.1145/2983576.2983581
KW  - web archiving
KW  - lifelogging
KW  - raspberry pi
KW  - wireless access point
ER  - 

TY  - CONF
TI  - An Evaluation of Caching Policies for Memento Timemaps
AU  - Brunelle, Justin F
AU  - Nelson, Michael L
AB  - As defined by the Memento Framework, TimeMaps are machine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of the mementos, archival restructuring, and transient errors of one or more archives. We study TimeMaps for 4,000 original resources over a three month period, note their change patterns, and develop a caching algorithm for TimeMaps suitable for a reverse proxy in front of a Memento aggregator. We show that TimeMap cardinality is constant or monotonically increasing for 80.2% of all TimeMap downloads in the observation period. The goal of the caching algorithm is to exploit the ideally monotonically increasing nature of TimeMaps and not cache responses with fewer mementos than the already cached TimeMap. This new caching algorithm uses conditional cache replacement and a Time To Live (TTL) value to ensure the user has access to the most complete TimeMap available. Based on our empirical data, a TTL of 15 days will minimize the number of mementos missed by users, and minimize the load on archives contributing to TimeMaps.
C1  - New York, NY, USA
C3  - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467717
SP  - 267
EP  - 276
PB  - ACM
SN  - 978-1-4503-2077-1
UR  - http://doi.acm.org/10.1145/2467696.2467717
KW  - web archiving
KW  - digital preservation
KW  - memento
KW  - http
KW  - timemaps
KW  - web architecture
ER  - 

TY  - CONF
TI  - Rank Synopses for Efficient Time Travel on the Web Graph
AU  - Berberich, Klaus
AU  - Bedathur, Srikanta
AU  - Weikum, Gerhard
C1  - New York, NY, USA
C3  - Proceedings of the 15th ACM International Conference on Information and Knowledge Management
DA  - 2006///
PY  - 2006
DO  - 10.1145/1183614.1183769
SP  - 864
EP  - 865
PB  - ACM
SN  - 1-59593-433-2
UR  - http://doi.acm.org/10.1145/1183614.1183769
KW  - web graph
KW  - pagerank
KW  - web archive search
KW  - web dynamics
ER  - 

TY  - CONF
TI  - Tools and techniques for harvesting the world wide web
AU  - Marill, J L
AU  - Boyko, A
AU  - Ashenfelder, M
AU  - Graham, L
AB  - Recently the Library of Congress began developing a strategy for the preservation of digital content. Efforts have focused on the need to select, harvest, describe, access and preserve Web resources. This poster focuses on the Library's initial investigation and evaluation of Web harvesting software tools.
C1  - New York, New York, USA
C3  - Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries - JCDL '04
DA  - 2004///
PY  - 2004
DO  - 10.1145/996350.996469
SP  - 403
PB  - ACM Press
SN  - 1-58113-832-6
UR  - http://doi.acm.org/10.1145/996350.996469
L4  - http://portal.acm.org/citation.cfm?doid=996350.996469
KW  - web archiving
KW  - digital preservation
KW  - web harvesting
KW  - harvesting tools
ER  - 

TY  - CONF
TI  - Exploring Web Archives Through Temporal Anchor Texts
AU  - Holzmann, Helge
AU  - Nejdl, Wolfgang
AU  - Anand, Avishek
AB  - Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.
C1  - New York, NY, USA
C3  - Proceedings of the 2017 ACM on Web Science Conference
DA  - 2017///
PY  - 2017
DO  - 10.1145/3091478.3091500
SP  - 289
EP  - 298
PB  - ACM
SN  - 978-1-4503-4896-6
UR  - http://doi.acm.org/10.1145/3091478.3091500
KW  - web archives
KW  - big data analysis
KW  - temporal information retrieval
ER  - 

TY  - CONF
TI  - Client-side Reconstruction of Composite Mementos Using Serviceworker
AU  - Alam, Sawood
AU  - Kelly, Mat
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - We use the ServiceWorker (SW) API to intercept HTTP requests for embedded resources and reconstruct Composite Mementos without the need for conventional URL rewriting typically per- formed by web archives. URL rewriting is a problem for archival replay systems, especially for URLs constructed by JavaScript, that frequently results in incorrect URI references. By intercept- ing requests on the client using SW, we are able to strategically reroute instead of rewrite. Our implementation moves rewrit- ing to clients, saving servers’ computing resources and allowing servers to return responses more quickly. In our experiments, re- trieving the original instead of rewritten pages from the archive resulted in a one-third reduction in time overhead and a one-fifth reduction in data overhead. Our system, reconstructive.js , prevents the live web from leaking into Composite Mementos while being easy to distribute and maintain.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
DA  - 2017///
PY  - 2017
SP  - 237
EP  - 240
PB  - IEEE Press
SN  - 978-1-5386-3861-3
UR  - http://dl.acm.org/citation.cfm?id=3200334.3200361
KW  - memento
KW  - archival replay
KW  - composite memento
KW  - serviceworker
KW  - web archive
ER  - 

TY  - CONF
TI  - Life Span of Web Pages: A Survey of 10 Million Pages Collected in 2001
AU  - Agata, Teru
AU  - Miyata, Yosuke
AU  - Ishita, Emi
AU  - Ikeuchi, Atsushi
AU  - Ueda, Shuichi
AB  - Identifying and tracking new information on the Web is important in sociology, marketing, and survey research, since new trends might be apparent in the new information. Such changes can be observed by crawling the Web periodically. In practice, however, it is impossible to crawl the entire expanding Web repeatedly. This means that the novelty of a page remains unknown, even if that page did not exist in previous snapshots. In this paper, we propose a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls. Using this novelty measure, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web. We evaluated the precision, recall, and miss rate of the novelty measure using our Japanese web archive, and applied it to a Web archive search engine.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2014///
PY  - 2014
SP  - 463
EP  - 464
PB  - IEEE Press
SN  - 978-1-4799-5569-5
UR  - http://dl.acm.org/citation.cfm?id=2740769.2740869
KW  - web archiving
KW  - digital preservation
KW  - internet archive
KW  - web page life span
ER  - 

TY  - CONF
TI  - InterPlanetary Wayback: The Permanent Web Archive
AU  - Alam, Sawood
AU  - Kelly, Mat
AU  - Nelson, Michael L
AB  - To facilitate permanence and collaboration in web archives, we built InterPlanetary Wayback to disseminate the contents of WARC files into the IPFS network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. We split the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, build a CDXJ index, and combine them at the time of replay. From a 1.0 GB sample Archive-It collection of WARCs containing 21,994 mementos, we found that on an average, 570 files can be indexed and disseminated into IPFS per minute. We also found that in our naive prototype implementation, replay took on an average 370 milliseconds per request.
C1  - New York, NY, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2925467
SP  - 273
EP  - 274
PB  - ACM
SN  - 978-1-4503-4229-2
UR  - http://doi.acm.org/10.1145/2910896.2925467
KW  - web archives
KW  - memento
KW  - interplanetary wayback
KW  - ipfs
KW  - ipwb
KW  - p2p file system
ER  - 

TY  - CONF
TI  - Arcomem: From Collect-all ARchives to COmmunity MEMories
AU  - Risse, Thomas
AU  - Peters, Wim
AB  - The ARCOMEM project is about memory institutions like archives, museums and libraries in the age of the Social Web. Social media are becoming more and more pervasive in all areas of life. ARCOMEM's aim is to help to transform archives into collective memories that are more tightly integrated with their community of users and to exploit Web 2.0 and the wisdom of crowds to make Web archiving a more selective and meaning-based process. ARCOMEM (FP7-IST-270239) is an Integrating Project in the FP7 program of the European Commission, which involves twelve partners from academia, industry and public sector. The project will run from January 1, 2011 to December 31, 2013.
C1  - New York, NY, USA
C3  - Proceedings of the 21st International Conference on World Wide Web
DA  - 2012///
PY  - 2012
DO  - 10.1145/2187980.2188027
SP  - 275
EP  - 278
PB  - ACM
SN  - 978-1-4503-1230-1
UR  - http://doi.acm.org/10.1145/2187980.2188027
KW  - web archiving
KW  - web crawler
KW  - architecture
KW  - text analysis
KW  - social web
ER  - 

TY  - CONF
TI  - Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities
AU  - Jackson, Andrew
AU  - Lin, Jimmy
AU  - Milligan, Ian
AU  - Ruest, Nick
AB  - Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. In this paper, we describe initial experiences in providing an exploratory search interface to web archives for humanities scholars and social scientists. We describe our initial implementation and discuss our findings in terms of desiderata for such a system. It is clear that the standard organization of a search engine results page (SERP), consisting of an ordered list of hits, is inadequate to support the needs of scholars. Shneiderman's mantra for visual information seeking ("overview first, zoom and filter, then details-on-demand") provides a nice organizing principle for interface design, to which we propose an addendum: "Make everything transparent". We elaborate on this by highlighting the importance of the temporal dimension of web pages as well as issues surrounding metadata and veracity.
C1  - New York, NY, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2910912
SP  - 103
EP  - 106
PB  - ACM
SN  - 978-1-4503-4229-2
UR  - http://doi.acm.org/10.1145/2910896.2910912
KW  - metadata
KW  - faceted browsing
KW  - shneiderman's mantra
KW  - veracity
ER  - 

TY  - CONF
TI  - Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving
AU  - Lin, Jimmy
AB  - Warcbase is an open-source platform for storing, managing, and analyzing web archives using modern "big data" infrastructure on commodity clusters---specifically, HBase for storage and Hadoop for data analytics. This paper describes an effort to scale "down" Warcbase onto a Raspberry Pi, an inexpensive single-board computer about the size of a deck of playing cards. Apart from an interesting technology demonstration, such a design presents new opportunities for personal web archiving, in enabling a low-cost, low-power, portable device that is able to continuously capture a user's web browsing history---not only the URLs of the pages that a user has visited, but the contents of those pages---and allowing the user to revisit any previously-encountered page, as it appeared at that time. Experiments show that data ingestion throughput and temporal browsing latency are adequate with existing hardware, which means that such capabilities are already feasible today.
C1  - New York, NY, USA
C3  - Proceedings of the 24th International Conference on World Wide Web
DA  - 2015///
PY  - 2015
DO  - 10.1145/2740908.2741695
SP  - 1351
EP  - 1355
PB  - ACM
SN  - 978-1-4503-3473-0
UR  - http://doi.acm.org/10.1145/2740908.2741695
KW  - raspberry pi
KW  - hadoop
KW  - hbase
ER  - 

TY  - CONF
TI  - Mining Relevant Time for Query Subtopics in Web Archives
AU  - Nguyen, Tu Ngoc
AU  - Kanhabua, Nattiya
AU  - Nejdl, Wolfgang
AU  - Niederée, Claudia
AB  - With the reflection of nearly all types of social cultural, societal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for temporal content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great benefit for expert users such as journalists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is completely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversification. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawling times) is extremely difficult. We introduce a brute-force approach to detect a time-reliable sub-collection and propose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.
C1  - New York, NY, USA
C3  - Proceedings of the 24th International Conference on World Wide Web
DA  - 2015///
PY  - 2015
DO  - 10.1145/2740908.2741702
SP  - 1357
EP  - 1362
PB  - ACM
SN  - 978-1-4503-3473-0
UR  - http://doi.acm.org/10.1145/2740908.2741702
KW  - temporal ranking
KW  - anchor text mining
KW  - result diversification
KW  - temporal subtopic
ER  - 

TY  - JOUR
TI  - Report on the Workshop on Web Archiving and Digital Libraries (WADL 2013)
AU  - Fox, Edward A
AU  - Farag, Mohamed M
T2  - SIGIR Forum
AB  - This workshop explored the integration of Web archiving and digital libraries, so the complete life cycle involved is covered, from creation/authoring, uploading/publishing in the Web (including Web 2.0), (focused) crawling, curation, indexing, exploration (including searching and browsing), (text) analysis, archiving, and up through long-term preservation. It included particular coverage of current topics of interest: challenges facing archiving initiatives, archiving related to disasters, interaction with and use of archive data, applications on an international scale, working with big data, mobile Web archiving, temporal issues, Memento, and SiteStory.
DA  - 2013///
PY  - 2013
DO  - 10.1145/2568388.2568408
VL  - 47
IS  - 2
SP  - 128
EP  - 133
SN  - 0163-5840
UR  - http://doi.acm.org/10.1145/2568388.2568408
ER  - 

TY  - CONF
TI  - Infrastructure for Supporting Exploration and Discovery in Web Archives
AU  - Lin, Jimmy
AU  - Gholami, Milad
AU  - Rao, Jinfeng
AB  - Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing. Relying on HBase for storage infrastructure simplifies the development of scalable and responsive applications. We describe a service that provides temporal browsing and an interactive visualization based on topic models that allows users to explore archived content.
C1  - New York, NY, USA
C3  - Proceedings of the 23rd International Conference on World Wide Web
DA  - 2014///
PY  - 2014
DO  - 10.1145/2567948.2579045
SP  - 851
EP  - 856
PB  - ACM
SN  - 978-1-4503-2745-9
UR  - http://doi.acm.org/10.1145/2567948.2579045
KW  - HBase
KW  - Hadoop
ER  - 

TY  - JOUR
TI  - NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives
AU  - Chen, Ling
AU  - Bhowmick, Sourav S
AU  - Nejdl, Wolfgang
T2  - Proc. VLDB Endow.
AB  - Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called near-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.
DA  - 2009///
PY  - 2009
DO  - 10.14778/1687627.1687757
VL  - 2
IS  - 1
SP  - 1150
EP  - 1161
SN  - 2150-8097
UR  - http://dx.doi.org/10.14778/1687627.1687757
ER  - 

TY  - CONF
TI  - How Much of the Web is Archived?
AU  - Ainsworth, Scott G
AU  - Alsum, Ahmed
AU  - SalahEldeen, Hany
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This question is studied by approximating the Web via sampling URIs from DMOZ, Delicious, Bitly, and search engine indexes and measuring number of archive copies available in various public web archives. The results indicate that 35%-90% of URIs have at least one archived copy, 17%-49% have two to five copies, 1%-8% have six to ten copies, and 8%-63% at least ten copies. The number of URI copies varies as a function of time, but only 14.6-31.3% of URIs are archived more than once per month.
C1  - New York, NY, USA
C3  - Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries
DA  - 2011///
PY  - 2011
DO  - 10.1145/1998076.1998100
SP  - 133
EP  - 136
PB  - ACM
SN  - 978-1-4503-0744-4
UR  - http://doi.acm.org/10.1145/1998076.1998100
KW  - web archiving
KW  - digital preservation
KW  - HTTP
KW  - web architecture
KW  - resource versioning
KW  - temporal applications
ER  - 

TY  - CONF
TI  - Structural and Visual Comparisons for Web Page Archiving
AU  - Law, Marc Teva
AU  - Thome, Nicolas
AU  - Gançarski, Stéphane
AU  - Cord, Matthieu
AB  - In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
C1  - New York, NY, USA
C3  - Proceedings of the 2012 ACM Symposium on Document Engineering
DA  - 2012///
PY  - 2012
DO  - 10.1145/2361354.2361380
SP  - 117
EP  - 120
PB  - ACM
SN  - 978-1-4503-1116-8
UR  - http://doi.acm.org/10.1145/2361354.2361380
KW  - web archiving
KW  - digital preservation
KW  - change detection algorithms
KW  - pattern recognition
KW  - support vector machines
ER  - 

TY  - CONF
TI  - Compact Full-text Indexing of Versioned Document Collections
AU  - He, Jinru
AU  - Yan, Hao
AU  - Suel, Torsten
AB  - We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM Conference on Information and Knowledge Management
DA  - 2009///
PY  - 2009
DO  - 10.1145/1645953.1646008
SP  - 415
EP  - 424
PB  - ACM
SN  - 978-1-60558-512-3
UR  - http://doi.acm.org/10.1145/1645953.1646008
KW  - web archives
KW  - wikipedia
KW  - inverted index
KW  - versioned documents
KW  - inverted index compression
KW  - search engines
ER  - 

TY  - CONF
TI  - Digital Libraries and Engines of Search: New Information Systems in the Context of the Digital Preservation
AU  - Campos, Ricardo
AB  - The first's library projects occur some years ago with digitization, but just in 1996, the first's web archive initiatives start occurring. Such, was based in the Internet growth and in its increasing use, items that revealed to be an opportunity to transform and readapt the traditional library services. In this context, search engines play a fundamental role of support to the new paradigm of knowledge, by capturing, storing and providing access to the resources, allowing the existence of a digital library in each computer with internet access. In this article we analyze the ways of developing a digital library, taking higher attention to the web harvesting technique, and presenting digital libraries capabilities and limitations. Then we fully summarize relevant projects and initiatives, to finally study the role of search engines in what concerns to, digital preservation, access and information diffusion.
C1  - New York, NY, USA
C3  - Proceedings of the 2007 Euro American Conference on Telematics and Information Systems
DA  - 2007///
PY  - 2007
DO  - 10.1145/1352694.1352703
SP  - 8:1
EP  - 8:9
PB  - ACM
SN  - 978-1-59593-598-4
UR  - http://doi.acm.org/10.1145/1352694.1352703
KW  - web archiving
KW  - digital preservation
KW  - digital libraries
KW  - web harvesting
KW  - search engines
KW  - information systems
ER  - 

TY  - JOUR
TI  - Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages
AU  - Alkwai, Lulwah M
AU  - Nelson, Michael L
AU  - Weigle, Michele C
T2  - ACM Trans. Inf. Syst.
AB  - It has long been suspected that web archives and search engines favor Western and English language webpages. In this article, we quantitatively explore how well indexed and archived Arabic language webpages are as compared to those from other languages. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multilingual), Raddadi, and Star28 (the last two primarily Arabic language). Using language identification tools, we eliminated pages not in the Arabic language (e.g., English-language versions of Aljazeera pages) and culled the collection to 7,976 Arabic language webpages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We compared the analysis of Arabic language pages with that of English, Danish, and Korean language pages. First, for each language, we sampled unique URIs from DMOZ; then, using language identification tools, we kept only pages in the desired language. Finally, we crawled the archived and live web to collect a larger sample of pages in English, Danish, or Korean. In total for the four languages, we analyzed over 500,000 webpages. We discovered: (1) English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. (2) Most Arabic and English language pages are located in the United States; only 14.84% of the Arabic URIs had an Arabic country code top-level domain (e.g., sa) and only 10.53% had a GeoIP in an Arabic country. Most Danish-language pages were located in Denmark, and most Korean-language pages were located in South Korea. (3) The presence of a webpage in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving in all four languages. In this work, we show that web archives and search engines favor English pages. However, it is not universally true for all Western-language webpages because, in this work, we show that Arabic webpages have a higher archival rate than Danish language webpages.
DA  - 2017///
PY  - 2017
DO  - 10.1145/3041656
VL  - 36
IS  - 1
SP  - 1:1
EP  - 1:34
SN  - 1046-8188
UR  - http://doi.acm.org/10.1145/3041656
KW  - Web archiving
KW  - digital preservation
KW  - Arabic web
KW  - Danish web
KW  - English web
KW  - indexing
KW  - Korean web
ER  - 

TY  - CONF
TI  - Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive
AU  - Ainsworth, Scott G
AU  - Nelson, Michael L
AB  - When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to less than 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
C1  - New York, NY, USA
C3  - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467718
SP  - 39
EP  - 48
PB  - ACM
SN  - 978-1-4503-2077-1
UR  - http://doi.acm.org/10.1145/2467696.2467718
KW  - web archiving
KW  - digital preservation
KW  - http
KW  - web architecture
KW  - resource versioning
KW  - temporal applications
ER  - 

TY  - CONF
TI  - Access Patterns for Robots and Humans in Web Archives
AU  - AlNoamany, Yasmin A
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - Although user access patterns on the live web are well-understood, there has been no corresponding study of how users, both humans and robots, access web archives. Based on samples from the Internet Archive's public Wayback Machine, we propose a set of basic usage patterns: Dip (a single access), Slide (the same page at different archive times), Dive (different pages at approximately the same archive time), and Skim (lists of what pages are archived, i.e., TimeMaps). Robots are limited almost exclusively to Dips and Skims, but human accesses are more varied between all four types. Robots outnumber humans 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. Robots almost always access TimeMaps (95% of accesses), but humans predominately access the archived web pages themselves (82% of accesses). In terms of unique archived web pages, there is no overall preference for a particular time, but the recent past (within the last year) shows significant repeat accesses.
C1  - New York, NY, USA
C3  - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467722
SP  - 339
EP  - 348
PB  - ACM
SN  - 978-1-4503-2077-1
UR  - http://doi.acm.org/10.1145/2467696.2467722
KW  - web archiving
KW  - user access patterns
KW  - web robot detection
KW  - web server logs
KW  - web usage mining
ER  - 

TY  - CONF
TI  - Generating Stories From Archived Collections
AU  - AlNoamany, Yasmin
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - With the extensive growth of the Web, multiple Web archiving initiatives have been started to archive different aspects of the Web. Services such as Archive-It exist to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge, resulting in the paradox of the larger the collection, the harder it is to understand. Meanwhile, as the sheer volume of data grows on the Web, "storytelling" is becoming a popular technique in social media for selecting Web resources to support a particular narrative or "story". We address the problem of understanding archived collections by proposing the Dark and Stormy Archive (DSA) framework, in which we integrate "storytelling" social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify. Inspired by the Turing Test, we evaluate the stories automatically generated by the DSA framework against a ground truth dataset of hand-crafted stories, generated by expert archivists from Archive-It collections. Using Amazon's Mechanical Turk, we found that the stories automatically generated by DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated stories.
C1  - New York, NY, USA
C3  - Proceedings of the 2017 ACM on Web Science Conference
DA  - 2017///
PY  - 2017
DO  - 10.1145/3091478.3091508
SP  - 309
EP  - 318
PB  - ACM
SN  - 978-1-4503-4896-6
UR  - http://doi.acm.org/10.1145/3091478.3091508
KW  - web archiving
KW  - archived collections
KW  - document similarity
KW  - information retrieval
KW  - internet archive
KW  - storytelling
KW  - web content mining
ER  - 

TY  - CONF
TI  - Web Spam Filtering in Internet Archives
AU  - Erdélyi, Miklós
AU  - Benczúr, András A
AU  - Masanés, Julien
AU  - Siklósi, Dávid
AB  - While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.
C1  - New York, NY, USA
C3  - Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
DA  - 2009///
PY  - 2009
DO  - 10.1145/1531914.1531918
SP  - 17
EP  - 20
PB  - ACM
SN  - 978-1-60558-438-6
UR  - http://doi.acm.org/10.1145/1531914.1531918
KW  - information retrieval
KW  - web spam
KW  - document classification
KW  - time series analysis
KW  - web archival
ER  - 

TY  - CONF
TI  - What is Part of That Resource?: User Expectations for Personal Archiving
AU  - Poursardar, Faryaneh
AU  - Shipman, Frank
AB  - Users wish to preserve Internet resources for later use. But wh at is part of and what is not part of an Internet resource remains an open question. In this paper we examine how specific relationships between web pages affect user perceptions of thei r being part of the same resource. T his study presented participa nts with pairs of pages and asked a bout their expectation for havin g access to the second page after t hey save the first. The primar y- page content in the study comes from multi-page stories, multi- image collections, product page s with reviews and ratings on separate pages, and short single page writings. Participants we re asked to agree or disagree with three statements regarding thei r expectation for later access. Nearly 80% of participants agreed in the case of articles spread across multiple pages, images in th e same collection, and additional details or assessments of produ ct information. About 50% agreed for related content on pages linked to by the original page or related items while only abou t 30% thought advertisements or wish lists linked to were part of the resource. Differences in responses to the same page pairs f or the three statements regarding later access indicate some users distinguish between what would be valuable to them and their expectations of syst ems saving or archiving web content
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
DA  - 2017///
PY  - 2017
SP  - 229
EP  - 238
PB  - IEEE Press
SN  - 978-1-5386-3861-3
UR  - http://dl.acm.org/citation.cfm?id=3200334.3200359
KW  - web archiving
KW  - digital preservation
KW  - personal archiving
ER  - 

TY  - CONF
TI  - Journey to the past
AU  - Jatowt, Adam
AU  - Kawai, Yukiko
AU  - Nakamura, Satoshi
AU  - Kidawara, Yutaka
AU  - Tanaka, Katsumi
AB  - While the Internet community recognized early on the need to store and preserve past content of the Web for future use, the tools developed so far for retrieving information from Web archives are still difficult to use and far less efficient than those developed for the "live Web." We expect that future information retrieval systems will utilize both the "live" and "past Web" and have thus developed a general framework for a past Web browser. A browser built using this framework would be a client-side system that downloads, in real time, past page versions from Web archives for their customized presentation. It would use passive browsing, change detection and change animation to provide a smooth and satisfactory browsing experience. We propose a meta-archive approach for increasing the coverage of past Web pages and for providing a unified interface to the past Web. Finally, we introduce query-based and localized approaches for filtered browsing that enhance and speed up browsing and information retrieval from Web archives.
C1  - New York, New York, USA
C3  - Proceedings of the seventeenth conference on Hypertext and hypermedia - HYPERTEXT '06
DA  - 2006///
PY  - 2006
DO  - 10.1145/1149941.1149969
SP  - 135
PB  - ACM Press
SN  - 1-59593-417-0
UR  - http://doi.acm.org/10.1145/1149941.1149969
L4  - http://portal.acm.org/citation.cfm?doid=1149941.1149969
KW  - web archive
KW  - past web
KW  - past web browser
ER  - 

TY  - CONF
TI  - Histrace: Building a Search Engine of Historical Events
AU  - Huang, Lian'en
AU  - Zhu, Jonathan J H
AU  - Li, Xiaoming
AB  - In this paper, we describe an experimental search engine on our Chinese web archive since 2001. The original data set contains nearly 3 billion Chinese web pages crawled from past 5 years. From the collection, 430 million "article-like" pages are selected and then partitioned into 68 million sets of similar pages. The titles and publication dates are determined for the pages. An index is built. When searching, the system returns related pages in a chronological order. This way, if a user is interested in news reports or commentaries for certain previously happened event, he/she will be able to find a quite rich set of highly related pages in a convenient way.
C1  - New York, NY, USA
C3  - Proceedings of the 17th International Conference on World Wide Web
DA  - 2008///
PY  - 2008
DO  - 10.1145/1367497.1367703
SP  - 1155
EP  - 1156
PB  - ACM
SN  - 978-1-60558-085-2
UR  - http://doi.acm.org/10.1145/1367497.1367703
KW  - web archive
KW  - text mining
KW  - replica detection
ER  - 

TY  - CONF
TI  - Factors Affecting Website Reconstruction from the Web Infrastructure
AU  - McCown, Frank
AU  - Diawara, Norou
AU  - Nelson, Michael L
AB  - When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.
C1  - New York, NY, USA
C3  - Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2007///
PY  - 2007
DO  - 10.1145/1255175.1255182
SP  - 39
EP  - 48
PB  - ACM
SN  - 978-1-59593-644-8
UR  - http://doi.acm.org/10.1145/1255175.1255182
KW  - web archiving
KW  - digital preservation
KW  - search engine caches
ER  - 

TY  - CONF
TI  - How Well Are Arabic Websites Archived?
AU  - Alkwai, Lulwah M
AU  - Nelson, Michael L
AU  - Weigle, Michele C
AB  - t is has long been anecdotally known that web archives and search engines favor Western and English-language sites. In this paper we quantitatively explore how well indexed and archived are Arabic language web sites. We began by sam- pling 15,092 unique URIs from three different website direc- tories: DMOZ (multi-lingual), Raddadi and Star28 (both primarily Arabic language). Using language identification tools we eliminated pages not in the Arabic language (e.g., English language versions of Al-Jazeera sites) and culled the collection to 7,976 definitely Arabic language web pages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic lan- guage pages. We discovered: 1) 46% are not archived and 31% are not indexed by Google ( www.google.com ), 2) only 14.84% of the URIs had an Arabic country code top-level domain (e.g., .sa ) and only 10.53% had a GeoIP in an Ara- bic country, 3) having either only an Arabic GeoIP or only an Arabic top-level domain appears to negatively impact archiving, 4) most of the archived pages are near the top level of the site and deeper links into the site are not well- archived, 5) the presence in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving.
C1  - New York, NY, USA
C3  - Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756912
SP  - 223
EP  - 232
PB  - ACM
SN  - 978-1-4503-3594-2
UR  - http://doi.acm.org/10.1145/2756406.2756912
KW  - web archiving
KW  - digital preservation
KW  - Arabic web
KW  - indexing
KW  - Design
KW  - Experimentation
KW  - Measuremen
ER  - 

TY  - CONF
TI  - Data Quality in Web Archiving
AU  - Spaniol, Marc
AU  - Denev, Dimitar
AU  - Mazeika, Arturas
AU  - Weikum, Gerhard
AU  - Senellart, Pierre
AB  - Web archives preserve the history of Web sites and have high long-term value for media and business analysts. Such archives are maintained by periodically re-crawling entire Web sites of interest. From an archivist's point of view, the ideal case to ensure highest possible data quality of the archive would be to "freeze" the complete contents of an entire Web site during the time span of crawling and capturing the site. Of course, this is practically infeasible. To comply with the politeness specification of a Web site, the crawler needs to pause between subsequent http requests in order to avoid unduly high load on the site's http server. As a consequence, capturing a large Web site may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. This paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, we present a crawling strategy that aims to ensure archive coherence by minimizing the diffusion of Web site captures. Preliminary experiments demonstrate the usefulness of the model and the effectiveness of the strategy.
C1  - New York, NY, USA
C3  - Proceedings of the 3rd Workshop on Information Credibility on the Web
DA  - 2009///
PY  - 2009
DO  - 10.1145/1526993.1526999
SP  - 19
EP  - 26
PB  - ACM
SN  - 978-1-60558-488-1
UR  - http://doi.acm.org/10.1145/1526993.1526999
KW  - web archiving
KW  - data quality
KW  - temporal coherence
ER  - 

TY  - CONF
TI  - Focused Crawl of Web Archives to Build Event Collections
AU  - Klein, Martin
AU  - Balakireva, Lyudmila
AU  - de Sompel, Herbert
AB  - Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past
C1  - New York, NY, USA
C3  - Proceedings of the 10th ACM Conference on Web Science
DA  - 2018///
PY  - 2018
DO  - 10.1145/3201064.3201085
SP  - 333
EP  - 342
PB  - ACM
SN  - 978-1-4503-5563-6
UR  - http://doi.acm.org/10.1145/3201064.3201085
KW  - web archiving
KW  - collection building
KW  - focused crawling
KW  - memento
ER  - 

TY  - CONF
TI  - Multiple Media Analysis and Visualization for Understanding Social Activities
AU  - Toyoda, Masashi
AB  - The Web has involved diverse media services, such as blogs, photo/video/link sharing, social networks, and microblogs. These Web media react to and affect realworld events, while the mass media still has big influence on social activities. The Web and mass media now affect each other. Our use of media has evolved dynamically in the last decade, and this affects our societal behavior. For instance, the first photo of a plane crash landing during the "Miracle on the Hudson" on January 15, 2009 appeared and spread on Twitter and was then used in TV news. During the "Chelyabinsk Meteor" incident on February 15, 2013, many people reported videos of the incident on YouTube then mass media reused them on TV programs. Large scale collection, analysis, and visualization of those multiple media are strongly required for sociology, linguistics, risk management, and marketing researches. We are building a huge scale Japanese web archive, and various analytics engines with a large-scale display wall. Our archive consists of 30 billion web pages crawled for 14 years, 1 billion blog posts for 7 years, and 15 billion tweets for 3 years. In this talk, I present several analysis and visualization systems based on network analysis, natural language processing, image processing, and 3 dimensional visualization.
C1  - New York, NY, USA
C3  - Proceedings of the 23rd International Conference on World Wide Web
DA  - 2014///
PY  - 2014
DO  - 10.1145/2567948.2579040
SP  - 825
EP  - 826
PB  - ACM
SN  - 978-1-4503-2745-9
UR  - http://doi.acm.org/10.1145/2567948.2579040
KW  - web archive
KW  - multiple media analysis
KW  - visualization
ER  - 

TY  - CONF
TI  - Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly
AU  - Brunelle, Justin F
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are correspondingly difficult to archive. JavaScript enables interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to discover and crawl all of the resources in deferred representations and the result of archiving deferred representations is archived web pages that are either incomplete or erroneously load embedded resources from the live web. We propose a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client-side events. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a crawl rate that was 38.9 times slower than simply using Heritrix. If our method was applied to the July 2015 Common Crawl dataset, a web-scale archival crawler will discover an additional 7.17 PB (5.12 times more) of information per year. This illustrates the significant increase in resources necessary for more thorough archival crawls.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
DA  - 2017///
PY  - 2017
SP  - 1
EP  - 10
PB  - IEEE Press
SN  - 978-1-5386-3861-3
UR  - http://dl.acm.org/citation.cfm?id=3200334.3200336
KW  - web archiving
KW  - digital preservation
KW  - memento
KW  - web crawling
ER  - 

TY  - CONF
TI  - Using Transactional Web Archives To Handle Server Errors
AU  - Xie, Zhiwu
AU  - Chandrasekar, Prashant
AU  - Fox, Edward A.
AB  - We describe a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the website's quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing pertinent support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.
C1  - New York, New York, USA
C3  - Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756955
SP  - 241
EP  - 242
PB  - ACM Press
SN  - 978-1-4503-3594-2
UR  - http://dl.acm.org/citation.cfm?doid=2756406.2756955
KW  - Memento
KW  - .
KW  - Digital   preservation
KW  - SiteStory
KW  - transactional   web   archiving
ER  - 

TY  - CONF
TI  - The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript
AU  - Kelly, Mat
AU  - Nelson, Michael L
AU  - Weigle, Michele C
AB  - When preserving web pages, archival crawlers sometimes produce a result that varies from what an end-user expects. To quantitatively evaluate the degree to which an archival crawler is capable of comprehensively reproducing a web page from the live web into the archives, the crawlers' capabilities must be evaluated. In this paper, we propose a set of metrics to evaluate the capability of archival crawlers and other preservation tools using the Acid Test concept. For a variety of web preservation tools, we examine previous captures within web archives and note the features that produce incomplete or unexpected results. From there, we design the test to produce a quantitative measure of how well each tool performs its task.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2014///
PY  - 2014
SP  - 25
EP  - 28
PB  - IEEE Press
SN  - 978-1-4799-5569-5
UR  - http://dl.acm.org/citation.cfm?id=2740769.2740774
KW  - web archiving
KW  - digital preservation
KW  - web crawler
KW  - Experimentation
KW  - Standardization
KW  - Verification
ER  - 

TY  - CONF
TI  - Recovering a Website's Server Components from the Web Infrastructure
AU  - McCown, Frank
AU  - Nelson, Michael L
AB  - Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. Interacting with these caches and archives, which we call the Web Infrastructure (WI), allows entire websites to be reconstructed in an approach we call lazy preservation. Unfortunately, the WI only captures the client-side view of a web resource. While this may be useful for recovering much of the content of a website, it is not helpful for restoring the scripts, web server configuration, databases, and other server-side components responsible for the construction of the website's resources. This paper proposes a novel technique for storing and recovering the server-side components of a website from the WI. Using erasure codes to embed the server-side components as HTML comments throughout the website, we can effectively reconstruct all the server components of a website when only a portion of the client-side resources have been extracted from the WI. We present the results of a preliminary study that baselines the lazy preservation of ten EPrints repositories and then examines the preservation of an EPrints repository that uses the erasure code technique to store the server-side EPrints software throughout the website. We found nearly 100% of the EPrints components were recoverable from the WI just two weeks after the repository came online, and it remained recoverable four months after it was "lost".
C1  - New York, NY, USA
C3  - Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2008///
PY  - 2008
DO  - 10.1145/1378889.1378911
SP  - 124
EP  - 133
PB  - ACM
SN  - 978-1-59593-998-2
UR  - http://doi.acm.org/10.1145/1378889.1378911
KW  - web archiving
KW  - digital preservation
KW  - backup
KW  - search engine caches
KW  - web server
ER  - 

TY  - CONF
TI  - Using Visual Pages Analysis for Optimizing Web Archiving
AU  - Saad, Myriam Ben
AU  - Gançarski, Stéphane
C1  - New York, NY, USA
C3  - Proceedings of the 2010 EDBT/ICDT Workshops
DA  - 2010///
PY  - 2010
DO  - 10.1145/1754239.1754287
SP  - 43:1
EP  - 43:7
PB  - ACM
SN  - 978-1-60558-990-9
UR  - http://doi.acm.org/10.1145/1754239.1754287
KW  - web archiving
KW  - web crawling
KW  - change detection
KW  - visual page analysis
ER  - 

TY  - CONF
TI  - Demonstrating intelligent crawling and archiving of web applications
AU  - Faheem, Muhammad
AU  - Senellart, Pierre
AB  - We demonstrate here a new approach to Web archival crawling, based on an application-aware helper that drives crawls of Web applications according to their types (especially, according to their content management systems). By adapting the crawling strategy to the Web application type, one is able to crawl a given Web application (say, a given forum or blog) with fewer requests than traditional crawling techniques. Additionally, the application-aware helper is able to extract semantic content from the Web pages crawled, which results in a Web archive of richer value to an archive user. In our demonstration scenario, we invite a user to compare application-aware crawling to regular Web crawling on the Web site of their choice, both in terms of efficiency and of experience in browsing and searching the archive.
C1  - New York, NY, USA
C3  - Proceedings of the 22nd ACM international conference on Conference on information &#38; knowledge management
DA  - 2013///
PY  - 2013
DO  - 10.1145/2505515.2508197
SP  - 2481
EP  - 2484
PB  - ACM
SN  - 978-1-4503-2263-8
UR  - http://doi.acm.org/10.1145/2505515.2508197
KW  - web archiving
KW  - crawling
KW  - content management system
KW  - web application
ER  - 

TY  - CONF
TI  - Web Spam Challenge Proposal for Filtering in Archives
AU  - Benczúr, András A
AU  - Erdélyi, Miklós
AU  - Masanés, Julien
AU  - Siklósi, Dávid
AB  - In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over different top level domains (TLDs). Each of them may have a large set of historic crawls. Efficient filtering would hence require (1) enhanced use of the time series of domain snapshots and (2) collaboration by transferring models across different TLDs. Corresponding Challenge tasks could hence include the distribution of crawl snapshot data for feature generation as well as classification of unlabeled new crawls of the same or even different TLDs.
C1  - New York, NY, USA
C3  - Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
DA  - 2009///
PY  - 2009
DO  - 10.1145/1531914.1531928
SP  - 61
EP  - 62
PB  - ACM
SN  - 978-1-60558-438-6
UR  - http://doi.acm.org/10.1145/1531914.1531928
KW  - information retrieval
KW  - web spam
KW  - evaluation
KW  - document classification
KW  - web archival
KW  - challenge
ER  - 

TY  - CONF
TI  - Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources
AU  - Nwala, Alexander C
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AU  - Ziegler, Adam B
AU  - Aizman, Anastasia
AB  - The national (non-local) news media has different priorities than the local news media. If one seeks to build a collection of stories about local events, the national news media may be insufficient, with the exception of local news which "bubbles" up to the national news media. If we rely exclusively on national media, or build collections exclusively on their reports, we could be late to the important milestones which precipitate major local events, thus, run the risk of losing important stories due to link rot and content drift. Consequently, it is important to consult local sources affected by local events. Our goal is to provide a suite of tools (beginning with two) under the umbrella of the Local Memory Project (LMP) to help users and small communities discover, collect, build, archive, and share collections of stories for important local events by leveraging local news sources. The first service (Geo) returns a list of local news sources (newspaper, TV and radio stations) in order of proximity to a user-supplied zip code. The second service (Local Stories Collection Generator) discovers, collects and archives a collection of news stories about a story or event represented by a user-supplied query and zip code pair. We evaluated 20 pairs of collections, Local (generated by our system) and non-Local, by measuring archival coverage, tweet index rate, temporal range, precision, and sub-collection overlap. Our experimental results showed Local and non-Local collections with archive rates of 0.63 and 0.83, respectively, and tweet index rates of 0.59 and 0.80, respectively. Local collections produced older stories than non-Local collections, at a higher precision (relevance) of 0.84 compared to a non-Local precision of 0.72. These results indicate that Local collections are less exposed, thus less popular than their nonLocal counterpart.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries
DA  - 2017///
PY  - 2017
SP  - 219
EP  - 228
PB  - IEEE Press
SN  - 978-1-5386-3861-3
UR  - http://dl.acm.org/citation.cfm?id=3200334.3200358
KW  - web archiving
KW  - collections building
KW  - digital collections
KW  - journalism
KW  - local news
KW  - news
ER  - 

TY  - CONF
TI  - Index Maintenance for Time-travel Text Search
AU  - Anand, Avishek
AU  - Bedathur, Srikanta
AU  - Berberich, Klaus
AU  - Schenkel, Ralf
AB  - Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different index structures have been proposed to efficiently support time-travel text search. None of them, however, can easily be updated as the Web evolves and new document versions are added to the web archive. In this work, we describe a novel index structure that efficiently supports time-travel text search and can be main- tained incrementally as new document versions are added to the web archive. Our solution uses a sharded index organiza- tion, bounds the number of spuriously read index entries per shard, and can be maintained using small in-memory buffers and append-only operations. We present experiments on two large-scale real-world datasets demonstrating that main- taining our novel index structure is an order of magnitude more efficient than periodically rebuilding one of the existing index structures, while query-processing performance is not adversely affected.
C1  - New York, NY, USA
C3  - Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
DA  - 2012///
PY  - 2012
DO  - 10.1145/2348283.2348318
SP  - 235
EP  - 244
PB  - ACM
SN  - 978-1-4503-1472-5
UR  - http://doi.acm.org/10.1145/2348283.2348318
KW  - web archives
KW  - index maintenance
KW  - time-travel text search
ER  - 

TY  - CONF
TI  - iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
AU  - Gossen, Gerhard
AU  - Demidova, Elena
AU  - Risse, Thomas
AB  - Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.
C1  - New York, NY, USA
C3  - Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756925
SP  - 75
EP  - 84
PB  - ACM
SN  - 978-1-4503-3594-2
UR  - http://doi.acm.org/10.1145/2756406.2756925
KW  - web archives
KW  - social media
KW  - focused crawling
KW  - web crawling
ER  - 

TY  - CONF
TI  - Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
AU  - Ogden, Jessica
AU  - Halford, Susan
AU  - Carr, Leslie
AB  - This paper makes the case for studying the work of web archivists, in an effort to explore the ways in which practitioners shape the preservation and maintenance of the archived Web in its various forms. An ethnographic approach is taken through the use of observation, interviews and documentary sources over the course of several weeks in collaboration with web archivists, engineers and managers at the Internet Archive - a private, non-profit digital library that has been archiving the Web since 1996. The concept of web archival labour is proposed to encompass and highlight the ways in which web archivists (as both networked human and non-human agents) shape and maintain the preserved Web through work that is often embedded in and obscured by the complex technical arrangements of collection and access. As a result, this engagement positions web archives as places of knowledge and cultural production in their own right, revealing new insights into the performative nature of web archiving that have implications for how these data are used and understood.1
C1  - New York, NY, USA
C3  - Proceedings of the 2017 ACM on Web Science Conference
DA  - 2017///
PY  - 2017
DO  - 10.1145/3091478.3091506
SP  - 299
EP  - 308
PB  - ACM
SN  - 978-1-4503-4896-6
UR  - http://doi.acm.org/10.1145/3091478.3091506
KW  - web archiving
KW  - information labour
KW  - knowledge production
KW  - materiality
KW  - sts
ER  - 

TY  - CONF
TI  - Usage Analysis of a Public Website Reconstruction Tool
AU  - McCown, Frank
AU  - Nelson, Michael L
AB  - The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often "lost" due to server crashes, viruses, hackers, run-ins with the law, bankruptcy and loss of interest. When a website is lost and backups are unavailable, an individual or third party can use Warrick to recover the website from several search engine caches and web archives (the Web Infrastructure). In this short paper, we present Warrick usage data obtained from Brass, a queueing system for Warrick hosted at Old Dominion University and made available to the public for free. Over the last six months, 520 individuals have reconstructed more than 700 websites with 800K resources from the Web Infrastructure. Sixty-two percent of the static web pages were recovered, and 41% of all website resources were recovered. The Internet Archive was the largest contributor of recovered resources (78%).
C1  - New York, NY, USA
C3  - Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2008///
PY  - 2008
DO  - 10.1145/1378889.1378955
SP  - 371
EP  - 374
PB  - ACM
SN  - 978-1-59593-998-2
UR  - http://doi.acm.org/10.1145/1378889.1378955
KW  - web archiving
KW  - digital preservation
KW  - search engine caches
ER  - 

TY  - CONF
TI  - Sub-document Timestamping of Web Documents
AU  - Zhao, Yue
AU  - Hauff, Claudia
AB  - Knowledge about a (Web) document's creation time has been shown to be an important factor in various temporal information retrieval settings. Commonly, it is assumed that such documents were created at a single point in time. While this assumption may hold for news articles and similar document types, it is a clear oversimplification for general Web documents. In this paper, we investigate to what extent (i) this simplifying assumption is violated for a corpus of Web documents, and, (ii) it is possible to accurately estimate the creation time of individual Web documents' components (so-called sub-documents).
C1  - New York, New York, USA
C3  - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2766462.2767803
SP  - 1023
EP  - 1026
PB  - ACM Press
SN  - 978-1-4503-3621-5
UR  - http://dl.acm.org/citation.cfm?doid=2766462.2767803
KW  - Web archiving
KW  - web archiving
KW  - sub-documents
KW  - timestamping
ER  - 

TY  - CONF
TI  - Creating a billion-scale searchable web archive
AU  - Gomes, Daniel
AU  - Costa, Miguel
AU  - Cruz, David
AU  - Miranda, João
AU  - Fontes, Simão
AB  - Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives.
C1  - New York, New York, USA
C3  - Proceedings of the 22nd International Conference on World Wide Web - WWW '13 Companion
DA  - 2013///
PY  - 2013
DO  - 10.1145/2487788.2488118
SP  - 1059
EP  - 1066
PB  - ACM Press
SN  - 978-1-4503-2038-2
UR  - http://dl.acm.org/citation.cfm?doid=2487788.2488118
KW  - Web
KW  - Preservation
KW  - Archive
KW  - Portuguese Web Archive
KW  - Temporal Search
KW  - Search
ER  - 

TY  - CONF
TI  - Archival HTTP redirection retrieval policies
AU  - AlSum, Ahmed
AU  - Nelson, Michael L.
AU  - Sanderson, Robert
AU  - Van de Sompel, Herbert
AB  - When retrieving archived copies of web resources (mementos) from web archives, the original resource's URI-R is typically used as the lookup key in the web archive. This is straightforward until the resource on the live web issues a redirect: R ->R`. Then it is not clear if R or R` should be used as the lookup key to the web archive. In this paper, we report on a quantitative study to evaluate a set of policies to help the client discover the correct memento when faced with redirection. We studied the stability of 10,000 resources and found that 48% of the sample URIs tested were not stable, with respect to their status and redirection location. 27% of the resources were not perfectly reliable in terms of the number of mementos of successful responses over the total number of mementos, and 2% had a reliability score of less than 0.5. We tested two retrieval policies. The first policy covered the resources which currently issue redirects and successfully resolved 17 out of 77 URIs that did not have mementos of the original URI, but did of the resource that was being redirected to. The second policy covered archived copies with HTTP redirection and helped the client in 58% of the cases tested to discover the nearest memento to the requested datetime.
C1  - New York, New York, USA
C3  - Proceedings of the 22nd International Conference on World Wide Web - WWW '13 Companion
DA  - 2013///
PY  - 2013
DO  - 10.1145/2487788.2488117
SP  - 1051
EP  - 1058
PB  - ACM Press
SN  - 978-1-4503-2038-2
UR  - http://dl.acm.org/citation.cfm?doid=2487788.2488117
KW  - Design
KW  - Experimentation
KW  - Standardization
ER  - 

TY  - CONF
TI  - ArchiveNow
AU  - Aturban, Mohamed
AU  - Kelly, Mat
AU  - Alam, Sawood
AU  - Berlin, John A.
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
AB  - ArchiveNow is a Python module for preserving web pages in on- demand web archives. This module allows a user to submit a URI of a web page for archiving at several configured web archives. Once the web page is captured, ArchiveNow provides the user with links to the archived copies of the web page. ArchiveNow is initially configured to use four archives but is easily configurable to add or remove other archives. In addition to pushing web pages to public archives, ArchiveNow , through the use of Wget and Squidwarc , allows users to generate local WARC files, enabling them to create their own personal and private archives.
C1  - New York, New York, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries - JCDL '18
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3203880
SP  - 321
EP  - 322
PB  - ACM Press
SN  - 978-1-4503-5178-2
UR  - http://dl.acm.org/citation.cfm?doid=3197026.3203880
KW  - Memento
KW  - WARC
KW  - Web Archiving
ER  - 

TY  - JOUR
TI  - Sharc
AU  - Denev, Dimitar
AU  - Mazeika, Arturas
AU  - Spaniol, Marc
AU  - Weikum, Gerhard
T2  - Proceedings of the VLDB Endowment
AB  - Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, di- rectory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.
DA  - 2009/08/01/
PY  - 2009
DO  - 10.14778/1687627.1687694
VL  - 2
IS  - 1
SP  - 586
EP  - 597
SN  - 21508097
UR  - http://dl.acm.org/citation.cfm?doid=1687627.1687694
KW  - Web Archiving
KW  - Data Quality
ER  - 

TY  - CONF
TI  - Building a story tracer out of a web archive
AU  - Huang, Lian'en
AU  - Zhu, Jonathan J. H.
AU  - Li, Xiaoming
AB  - There are quite a few web archives around the world, such as Internet Archive and Web InfoMall (http://www.infomall.cn). Nevertheless, we have not seen substantial mechanism built on top of the archives to render the value of the data beyond what the Wayback machine offers. One of the reasons for this situation is the lack of a system vision and design which encompasses the oceanic data in a meaningful and cost-effective way. This paper describes an effort in this direction.
C1  - New York, New York, USA
C3  - Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL '08
DA  - 2008///
PY  - 2008
DO  - 10.1145/1378889.1379000
SP  - 455
PB  - ACM Press
SN  - 978-1-59593-998-2
UR  - http://portal.acm.org/citation.cfm?doid=1378889.1379000
KW  - Web archive
KW  - Text mining
ER  - 

TY  - CONF
TI  - Learning temporal-dependent ranking models
AU  - Costa, Miguel
AU  - Couto, Francisco
AU  - Silva, Mário
AB  - Web archives already hold together more than 534 billion files and this number continues to grow as new initiatives arise. Searching on all versions of these files acquired throughout time is challenging, since users expect as fast and precise answers from web archives as the ones provided by current web search engines. This work studies, for the first time, how to improve the search effectiveness of web archives, including the creation of novel temporal features that explore the correlation found between web document persistence and relevance. The persistence was analyzed over 14 years of web snapshots. Additionally, we propose a temporal-dependent ranking framework that exploits the variance of web characteristics over time influencing ranking models. Based on the assumption that closer periods are more likely to hold similar web characteristics, our framework learns multiple models simultaneously, each tuned for a specific period. Experimental results show significant improvements over the search effectiveness of single-models that learn from all data independently of its time. Thus, our approach represents an important step forward on the state-of-the-art IR technology usually employed in web archives.
C1  - New York, New York, USA
C3  - Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval - SIGIR '14
DA  - 2014///
PY  - 2014
DO  - 10.1145/2600428.2609619
SP  - 757
EP  - 766
PB  - ACM Press
SN  - 978-1-4503-2257-7
UR  - http://dl.acm.org/citation.cfm?doid=2600428.2609619
KW  - web archives
KW  - temporal-dependent ranking
ER  - 

TY  - CONF
TI  - Persistent annotations deserve new URIs
AU  - Alasaadi, Abdulla
AU  - Nelson, Michael L.
AB  - Some digital libraries support annotations, but sharing these annotations with other systems or across the web is difficult because of the need of special applications to read and decode these annotations. Due to the frequent change of web resources, the annotation's meaning can change if the underlying resources change. This project concentrates on minting a new URI for every annotation and creating a persistent and independent archived version of all resources. Users should be able to select a segment of an image or a video to be part of the annotation. The media fragment URIs described in the Open Annotation Collaboration data model can be used, but in practice they have limits, and they face the lack of support by the browsers. So in this project the segments of images, and videos can be used in the annotations without using media fragment URIs.
C1  - New York, New York, USA
C3  - Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries - JCDL '11
DA  - 2011///
PY  - 2011
DO  - 10.1145/1998076.1998113
SP  - 195
PB  - ACM Press
SN  - 978-1-4503-0744-4
UR  - http://portal.acm.org/citation.cfm?doid=1998076.1998113
KW  - Web Archiving
KW  - URI
KW  - Reliability
KW  - Design
KW  - Annotation
KW  - Persistence
ER  - 

TY  - JOUR
TI  - Memory Hole or Right to Delist? Implications of the Right to be Forgotten for Web Archiving ; Trou mémoriel ou droit au déréférencement ? Les implications du droit à l’oubli pour l’archivage du Web
AU  - Dulong de Rosnay, Melanie
AU  - Guadamuz, Andrés
AB  - International audience ; This article studies the possible impact of the “right to be forgotten” (RTBF) on the preservation of native digital heritage. It analyses the extent to which archival practices may be affected by the new right, and whether the web may become impossible to preserve for future generations, risking to disappear from memories and history since no version would be available in public or private archives. Collective rights to remember and to memory, free access to information and freedom of expression, seem to clash with private individuals’ right to privacy. After a presentation of core legal concepts of privacy, data protection and freedom of expression, we analyse the case of the European Union Court of Justice vs. Google concerning the right to be forgotten, and look deeper into the controversies generated by the decision. We conclude that there is no room for concern for archives and for the right to remember given the restricted application of RTBF. ; Cet article étudie l’impact possible du « droit à l’oubli » (RTBF) sur la préservation du patrimoine numérique natif. Il analyse si les pratiques d'archivage sont susceptibles d’être affectées par le nouveau droit et s’il pourrait devenir impossible de préserver le Web pour les générations futures, avec le risque pour certains contenus de disparaître de la mémoire et de l’histoire si aucune version n’était disponible dans les archives publiques ou privées. Le droit collectif au souvenir et à la mémoire, l’accès libre à l'information et la liberté d'expression semblent entrer en conflit avec les droits individuels à la vie privée. Après une présentation des concepts juridiques fondamentaux de la vie privée, de la protection des données personnelles et de la liberté d'expression, nous analysons l’arrêt Google de la Cour de Justice de l’Union Européenne et le droit à l’oubli, et examinons les controverses qui ont été générées par la décision. On conclut que les archives et le droit au souvenir ne seront pas affectés par le droit à l’oubli, étant donné son application restreinte.
DA  - 2017///
PY  - 2017
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://halshs.archives-ouvertes.fr/halshs-01399314
KW  - Google
KW  - Wikipedia
KW  - web archives
KW  - [ SHS.DROIT ] Humanities and Social Sciences/Law
KW  - [ SHS.INFO ] Humanities and Social Sciences/Librar
KW  - [ SHS.SCIPO ] Humanities and Social Sciences/Polit
KW  - data protection
KW  - digital archives
KW  - memory
KW  - privacy
KW  - right to be forgotten
KW  - right to remember
ER  - 

TY  - JOUR
TI  - Investigation of the Currency, Disappearance and Half-Life of Urls of Web Resources Cited In Iranian Researchers: A Comparative Study.
AU  - Tajedini, Oranus
AU  - Sadatmoosavi, Ali
AU  - Ghazizade, Azita
AU  - Tajedini, Atefe
T2  - International Journal of Information Science & Management
AB  - This research was intended to comparatively investigate the currency, disappearance and half-life of URLs of web resources cited in Iranian researchers' articles indexed in ISI in information science, psychology and management from 2009 to 2011. The research method was citation analysis. The statistical population of this research was all articles by Iranian researchers in psychology, information science and management from 2009 to 2011 which were indexed in SSCI. In order to extract bibliographic information of articles, ISI database was searched and the titles of the articles were extracted. After investigating the currency and disappearance of cited URLs and calculating the half-life of web resources, collected data were analyzed in accordance with research questions by means of Excel Software. The results of this research revealed that in articles written by Iranian researchers indexed in ISI in information science, psychology and management there were 6152, 3639 and 8926 citations, respectively, of which 13.7, 44.8 and 14.23 percent were online citations, respectively. The most frequently used domain in all three fields was .org. The most stable and persistent domain in psychology was .com, in information science was .org and in management was for those domains other than the mentioned domains. The most frequent file format was pdf in all three fields. In information science, pdf. Files were the most stable while in management, rtf files and in psychology, ppt files were the most stable ones, respectively. In the initial search for online citations in psychology, information science and management, respectively, 58, 82 and 88 percent of citations were accessible which were even increased after second check with due measurements to 95, 98 and 97 percent, respectively. The research results also demonstrated that most accessible internet addresses in investigated articles of all three fields were found in the cited internet address. The status of inaccessible internet addresses in all investigated articles regarding error messages also indicated that in psychology and management 404 error message (Not found) was the most frequent error with 34 and 22 percent, respectively and in information science, 403 error message (forbidden) was the most frequent error message with 21 percent. The average half-life of online citations calculated in all investigated articles was 2.6 years which was calculated as 3 years and 4 months in information science, 2 years and 5 months in management and 1 year and 9 months in psychology. The results of this research showed that decay of internet addresses should be regarded as a problem the most important reason of which is website reorganization and changes made to the names of internet domains. Some fields are more exposed to and affected by the consequences of decay of internet addresses. The influence of inactive links on the journals of a field is different based on the reliance of authors on internet based information. The absolute number of internet addresses also strengthens the problem of decay of internet addresses for the readers of the articles as compared with those journals whose authors have only cited a few online citations. The consequences of inactive links for those articles and resources which can be accessed through different ways or their print version is accessible are less serious. Tools like internet archives might make it possible to have a snapshot of the content of a site in a particular time. Google doesn't index dynamic pages or pages and sites which use robots.txt coding to prevent crawling. The best solution to improve the accessibility of internet resources is to request for all internet information be analyzed and recorded while examining the manuscripts. In so doing, the responsibility to archive information will be assigned to the publisher. [ABSTRACT FROM AUTHOR]
DA  - 2018/01//
PY  - 2018
VL  - 16
IS  - 1
SP  - 27
EP  - 47
SN  - 20088302
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Internet Archive
KW  - Information science
KW  - Web archives
KW  - Citation analysis
KW  - Citation of electronic information resources
KW  - Half-life of Web References
KW  - Uniform Resource Locators
KW  - URL Persistence
KW  - Web Citation Availability
KW  - Half-life of Web References.
ER  - 

TY  - JOUR
TI  - Digital Heritage and Heritagization ; Patrimoine et patrimonialisation numériques
AU  - Musiani, Francesca
AU  - Schafer, Valerie
AB  - Introduction to a special issueThe six articles and the introduction composing this issue fully situate themselves within the interdisciplinary dimension of digital heritage analyses, including perspectives from history, information and communication sciences, sociology of innovation, digital humanities or juridical sciences.
DA  - 2017///
PY  - 2017
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://orbilu.uni.lu/handle/10993/35223
L4  - http://hdl.handle.net/10993/35223
KW  - digital
KW  - web archives
KW  - Arts & humanities :: Multidisciplinary
KW  - Arts & sciences humaines :: Multidisciplinaire
KW  - born-digital heritage
KW  - digital traces
KW  - general & others [A99]
KW  - généralités & autres [A99]
KW  - history
ER  - 

TY  - JOUR
TI  - Offene Archive: Archive, Nutzer und Technologie im Miteinander
AU  - Gillner, Bastian
T2  - OPEN ARCHIVES: ARCHIVES, USERS AND TECHNOLOGY INTERCONNECTED.
AB  - The use of archives in the digital age is still a mostly analogue activity. This is not only due to the fact that the digitization of materials is costly and time-consuming, but also that there is a widely spread lack of interest in using the possibilities provided by the internet for the own agenda. For two decades the internet has primarily been a place for archives to present fixed (meta)data of archival materials. The concept of open archives strives to adapt the use of archives so far to the realities of the digital age. Its goal is to facilitate open data, focussing on users and using of digital tools. Only the interaction of those aspects can help show archives a way how to make the cultural heritage available to a large audience in a digital environment and how to make use of it in a variety of manners. [ABSTRACT FROM AUTHOR]
DA  - 2018/01//
PY  - 2018
VL  - 71
IS  - 1
SP  - 13
EP  - 21
SN  - 00039500
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - Web archives
KW  - Archives
KW  - Digital libraries
KW  - Digitization of archival materials
KW  - Open data movement
KW  - Preservation of cultural property
ER  - 

TY  - JOUR
TI  - Glitch
AU  - Frederick, Ursula K.
T2  - Journal of Contemporary Archaeology
AB  - The rapid and continual advancement of the internet as a platform for communication on archaeological topics has brought permanent changes to the methods through which we present information from the sector to the public. This article discusses the potential for an exploration of the UK web archives for information about the history of archaeology online, and a case study undertaken as part of a Big Data project at the British Library by the author. The article concludes that we have a significant issue for media archaeologists in the future; the lack of material evidence for these iterations means we risk losing an understanding of our social, economic, cultural, and technological histories and our perception of these developments over time. It suggests that further exploration of these archives from an archaeological perspective could be beneficial both as an investigation of the iterations of digital archaeology (the creation of a history of public engagement with the subject), and as a study of the use of archaeological techniques for archival research. [ABSTRACT FROM AUTHOR]
DA  - 2015/08/29/
PY  - 2015
DO  - 10.1558/jca.v2i1.28244
VL  - 2
IS  - 1
SP  - S28
EP  - S32
SN  - 2051-3429
UR  - http://10.0.6.22/jca.v2i1.28284
L4  - http://www.equinoxpub.com/journals/index.php/JCA/article/view/28244
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - web archives
KW  - ARCHIVES
KW  - archaeology
KW  - ARCHAEOLOGY & history
KW  - ARCHIVAL resources
KW  - digital communications
KW  - digital data
KW  - DOCUMENTATION
KW  - media archaeology
KW  - SOCIOECONOMICS
ER  - 

TY  - CONF
TI  - Understanding computational web archives research methods using research objects
AU  - Maemura, Emily
AU  - Becker, Christoph
AU  - Milligan, Ian
AB  - Use of computational methods for exploration and analysis of web archives sources is emerging in new disciplines such as digital humanities. This raises urgent questions about how such research projects process web archival material using computational methods to construct their findings. This paper aims to enable web archives scholars to document their practices systematically to improve the transparency of their methods. We adopt the Research Object framework to characterize three case studies that use computational methods to analyze web archives within digital history research. We then discuss how the framework can support the characterization of research methods and serve as a basis for discussions of methods and issues such as reuse and provenance. The results suggest that the framework provides an effective conceptual perspective to describe and analyze the computational methods used in web archive research on a high level and make transparent the choices made in the process. The documentation of the research process contributes to a better understanding of the findings and their provenance, and the possible reuse of data, methods, and workflows.
C3  - 2016 IEEE International Conference on Big Data (Big Data)
DA  - 2016/12//
PY  - 2016
DO  - 10.1109/BigData.2016.7840982
SP  - 3250
EP  - 3259
PB  - IEEE
SN  - 978-1-4673-9005-7
UR  - http://ieeexplore.ieee.org/document/7840982/
KW  - computational methods
KW  - web archives
KW  - digital curation
KW  - computational archival science
KW  - research objects
ER  - 

TY  - JOUR
TI  - Preserving Social Media: The Problem of Access.
AU  - Thomson, Sara Day
AU  - Kilbride, William
T2  - New Review of Information Networking
AB  - This article is part of a 12-month study commissioned by the UK Data Service as part of the “Big Data Network” program funded by the Economic and Social Research Council (ESRC). The larger study focuses on the potential uses and accompanying challenges of data generated by social networking applications. This article, “Preserving Social Media: The Problem of Access,” comprises an excerpt of that longer study, allowing the authors a space to explore in closer detail the issue of making social media archives accessible to researchers and students now and in the future.© Sara Day Thomson and William Kilbride [ABSTRACT FROM AUTHOR]
DA  - 2015/05//
PY  - 2015
VL  - 20
IS  - 1/2
SP  - 261
EP  - 275
SN  - 13614576
UR  - http://10.0.4.56/13614576.2015.1114842
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - DIGITAL preservation
KW  - DIGITIZATION of archival materials
KW  - web archives
KW  - digital preservation
KW  - WEB archives
KW  - access restrictions
KW  - data-driven research
KW  - SOCIAL media addiction
KW  - social media preservation
KW  - STATISTICAL decision making
ER  - 

TY  - JOUR
TI  - Digital Contemporary History Sources, Tools, Methods, Issues
AU  - Webster, Peter
T2  - Temp - tidsskrift for historie
AB  - Digital contemporary history: sources, tools, methods, issuesThis essay suggests that there has been a relative lack of digitally enabled historical research on the recent past, when compared to earlier periods of history. It explores why this might be the case, focussing in particular on both the obstacles and some missing drivers to mass digitisation of primary sources for the 20th century. It suggests that the situation is likely to change, and relatively soon, as a result of the increasing availability of sources that were born digital, and of Web archives in particular. The article ends with some reflections on several shifts in method and approach which that changed situation is likely to entail.
DA  - 2017///
PY  - 2017
VL  - 7
IS  - 14
SP  - 30
EP  - 38
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - https://tidsskrift.dk/temp/article/view/96386/145232
KW  - web archives
KW  - digital history
KW  - digital research
KW  - digital sources
KW  - information society
ER  - 

TY  - JOUR
TI  - If these crawls could talk: Studying and documenting web archives provenance
AU  - Maemura, Emily
AU  - Worby, Nicholas
AU  - Milligan, Ian
AU  - Becker, Christoph
T2  - Journal of the Association for Information Science and Technology
AB  - The increasing use and prominence of web archives raises the urgency of establishing mechanisms for transparency in the making of web archives to facilitate the process of evaluating a web archive’s provenance, scoping, and absences. Some choices and process events are captured automatically, but their interactions are not currently well understood or documented. This study examined the decision space of web archives and its role in shaping what is and what is not captured in the web archiving process. By comparing how three different web archives collections were created and documented, we investigate how curatorial decisions interact with technical and external factors and we compare commonalities and differences. The findings reveal the need to understand both the social and technical context that shapes those decisions and the ways in which these individual decisions interact. Based on the study, we propose a framework for documenting key dimensions of a collection that addresses the situated nature of the organizational context, technical specificities, and unique characteristics of web materials that are the focus of a collection. The framework enables future researchers to undertake empirical work studying the process of creating web archives collections in different contexts.
DA  - 2018/10//
PY  - 2018
DO  - 10.1002/asi.24048
VL  - 69
IS  - 10
SP  - 1223
EP  - 1233
SN  - 23301635
UR  - http://doi.wiley.com/10.1002/asi.24048
ER  - 

TY  - JOUR
TI  - Webarchiválás és a történeti kutatások
AU  - Kokas, Károly
AU  - Drótos, László
T2  - Digitális Bölcsészet
AB  - A digitálisan születő tartalom sokkal részletesebb és teljesebb leképezése a jelennek, mint ami régebbi korokban a hagyományos információhordozó eszközökkel rögzíthető volt. A tanulmány első része arról ad áttekintést, hogy milyen próbálkozások és technológiák léteznek ennek a digitális jelennek a megőrzésére, illetve milyen korlátai vannak a már működő webarchívumoknak. A dolgozat második része azt vizsgálja, hogy a történeti szempontú kutatásoknak hogyan lehet hasznára mindez, s hogyan lesz — elsősorban — a közelmúlt történetének is elsőrangú forrása. A szerzők arra is rámutatnak, hogy a webaratások következtében előálló hatalmas adatsilók egészen új típusú forráskezelést és módszertant kívánnak majd meg, miközben azzal kecsegtetnek, hogy egészen új típusú eredményeket is fel lehet majd mutatni segítségükkel.
DA  - 2018/07/16/
PY  - 2018
DO  - 10.31400/dh-hun.2018.1.129
VL  - 1
IS  - 1
SP  - 35
EP  - 55
SN  - 2630-9696
UR  - http://ojs.elte.hu/index.php/digitalisbolcseszet/article/view/129
KW  - digitális bölcsészet
KW  - digitális megőrzés
KW  - webarchiválás
KW  - webhistoriográfia
ER  - 

TY  - CONF
TI  - Social Media Collecting at the National Library of New Zealand
AU  - Macnaught, Bill
AB  - Collecting content from the internet is an increasingly significant part of collection building at the National Library of New Zealand. Social media collecting is a new aspect of our digital collecting. We currently collect social media both under our legal deposit legislation and through donation as part of personal papers or archives. Social media offers unique content and voices, not always available in other formats. While this gives us new opportunities to diversify our collections, it isn’t without challenges. Content is shifting away from traditional websites to social media. This is understandable – it’s easier to post content, quick to circulate and cheaper. However, it is also comes with new collecting challenges.
C1  - Kuala Lumpur
C3  - IFLA WLIC 2018 – Kuala Lumpur, Malaysia – Transform Libraries, Transform Societies Session 93 - National Libraries and Social Media - Meeting the Challenges of Acquiring, Preserving and Proving Long-Term Access - National Libraries
DA  - 2018///
PY  - 2018
PB  - IFLA
UR  - http://library.ifla.org/id/eprint/2274
Y2  - 2018/09/10/
L4  - http://library.ifla.org/2274/1/093-macnaught-en.pdf
KW  - Collection development
KW  - Digital collecting
KW  - Legal Deposit
KW  - legislation.
KW  - Social Media collecting
KW  - Social
ER  - 

TY  - CHAP
TI  - Detecting Off-Topic Pages in Web Archives
AU  - AlNoamany, Yasmin
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - Research and Advanced Technology for Digital Libraries. TPDL 2015. Lecture Notes in Computer Science, vol 9316.
A2  - Kapidakis, Sarantos
A2  - Mazurek, Cezary
A2  - Werla, Marcin
AB  - Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting off-topic pages in Web archive collections. We evaluate six different methods to detect when the page has gone off-topic through subsequent captures. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold $$-$$0.85 performs the best with accuracy = 0.987, $$F_{1}$$score = 0.906, and AUC = 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting the off-topic pages is 0.92.
CY  - Cham
DA  - 2015/01//
PY  - 2015
SP  - 225
EP  - 237
PB  - Springer International Publishing
SN  - 978-3-319-24592-8
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L1  - http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-off-topic.pdf
L4  - http://link.springer.com/chapter/10.1007/978-3-319-24592-8_17
L4  - http://link.springer.co/
KW  - Web archiving
KW  - Internet Archive
KW  - Archived collections
KW  - Document filtering
KW  - Document similarity
KW  - Information retrieval
KW  - Web content mining
ER  - 

TY  - CONF
TI  - No More 404s
AU  - Zhou, Ke
AU  - Grover, Claire
AU  - Klein, Martin
AU  - Tobin, Richard
AB  - The citation of resources is a fundamental part of scholarly discourse. Due to the popularity of the web, there is an increasing trend for scholarly articles to reference web resources (e.g. software, data). However, due to the dynamic nature of the web, the referenced links may become inaccessible ('rotten') sometime after publication, returning a "404 Not Found" HTTP error. In this paper we first present some preliminary findings of a study of the persistence and availability of web resources referenced from papers in a large-scale scholarly repository. We reaffirm previous research that link rot is a serious problem in the scholarly world and that current web archives do not always preserve all rotten links. Therefore, a more pro-active archival solution needs to be developed to further preserve web content referenced in scholarly articles. To this end, we propose to apply machine learning techniques to train a link rot predictor for use by an archival framework to prioritise pro-active archiving of links that are more likely to be rotten. We demonstrate that we can obtain a fairly high link rot prediction AUC (0.72) with only a small set of features. By simulation, we also show that our prediction framework is more effective than current web archives for preserving links that are likely to be rotten. This work has a potential impact for the scholarly world where publishers can utilise this framework to prioritise the archiving of links for digital preservation, especially when there is a large quantity of links to be archived.
C1  - New York, New York, USA
C3  - Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756940
SP  - 233
EP  - 236
PB  - ACM Press
SN  - 978-1-4503-3594-2
UR  - http://dl.acm.org/citation.cfm?doid=2756406.2756940
L4  - http://doi.acm.org/10.1145/2756406.2756940
KW  - digital preservation
KW  - repositories
KW  - web persistence
ER  - 

TY  - CHAP
TI  - Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives
AU  - Souza, Tarcisio
AU  - Demidova, Elena
AU  - Risse, Thomas
AU  - Holzmann, Helge
AU  - Gossen, Gerhard
AU  - Szymanski, Julian
T2  - IKC 2015: Semantic Keyword-based Search on Structured Data Sources
A2  - Cardoso, Jorge
A2  - Guerra, Francesco
A2  - Houben, Geert-Jan
A2  - Pinto, Alexandre Miguel
A2  - Velegrakis, Yannis
AB  - Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents.
CY  - Cham
DA  - 2015///
PY  - 2015
SP  - 153
EP  - 166
PB  - Springer International Publishing
SN  - 978-3-319-27932-9
UR  - http://link.springer.com/10.1007/978-3-319-27932-9_14
ER  - 

TY  - CHAP
TI  - A Proposal of a Big Web Data Application and Archive for the Distributed Data Processing with Apache Hadoop
AU  - Lnenicka, Martin
AU  - Hovad, Jan
AU  - Komarkova, Jitka
T2  - Computational Collective Intelligence. Lecture Notes in Computer Science, vol 9330
A2  - Núñez, Manuel
A2  - Nguyen, Ngoc Thanh
A2  - Camacho, David
A2  - Trawiński, Bogdan
AB  - In recent years, research on big data, data storage and other topics that represent innovations in the analytics field has become very popular. This paper describes a proposal of a big web data application and archive for the distributed data processing with Apache Hadoop, including the framework with selected methods, which can be used with this platform. It proposes a workflow to create a web content mining application and a big data archive, which uses modern technologies like Python, PHP, JavaScript, MySQL and cloud services. It also shows the overview about the architecture, methods and data structures used in the context of web mining, distributed processing and big data analytics.
CY  - Cham
DA  - 2015///
PY  - 2015
SP  - 285
EP  - 294
PB  - Springer International Publishing
SN  - 978-3-319-24306-1
UR  - http://link.springer.com/10.1007/978-3-319-24306-1_28
KW  - Apache Hadoop
KW  - Web content mining
KW  - 1
KW  - Big data analytics
KW  - Big web data
KW  - Distributed data processing
KW  - Python
ER  - 

TY  - CHAP
TI  - The Influence of Client Platform on Web Page Content: Measurements, Analysis, and Implications
AU  - Sanders, Sean
AU  - Sanka, Gautam
AU  - Aikat, Jay
AU  - Kaur, Jasleen
T2  - WISE 2015: Web Information Systems Engineering – WISE 2015
A2  - Wang, Jianyong
A2  - Cellary, Wojciech
A2  - Wang, Dingding
A2  - Wang, Hua
A2  - Chen, Shu-Ching
A2  - Li, Tao
A2  - Zhang, Yanchun
AB  - Modern web users have access to a wide and diverse range of client platforms to browse the web. While it is anecdotally believed that the same URL may result in a different web page across different client platforms, the extent to which this occurs is not known. In this work, we systematically study the impact of different client platforms (browsers, operating systems, devices, and vantage points) on the content of base HTML pages. We collect and analyze the base HTML page downloaded for 3876 web pages composed of the top 250 web sites using 32 different client platforms for a period of 30 days — our dataset includes over 3.5 million web page downloads. We find that client platforms have a statistically significant influence on web page downloads in both expected and unexpected ways. We discuss the impact that these results will have in several application domains including web archiving, user experience, social interactions and information sharing, and web content sentiment analysis.
CY  - Cham
DA  - 2015///
PY  - 2015
SP  - 1
EP  - 16
PB  - Springer International Publishing
SN  - 978-3-319-26187-4
UR  - http://link.springer.com/10.1007/978-3-319-26187-4_1
ER  - 

TY  - CHAP
TI  - Web Content Management Systems Archivability
AU  - Banos, Vangelis
AU  - Manolopoulos, Yannis
T2  - Advances in Databases and Information Systems. ADBIS 2015
A2  - Tadeusz, Morzy
A2  - Valduriez, Patrick
A2  - Bellatreche, Ladjel
AB  - Web archiving is the process of collecting and preserving web content in an archive for current and future generations. One of the key issues in web archiving is that not all websites can be archived correctly due to various issues that arise from the use of different technologies, standards and implementation practices. Nevertheless, one of the common denominators of current websites is that they are implemented using a Web Content Management System (WCMS). We evaluate the Website Archivability (WA) of the most prevalent WCMSs. We investigate the extent to which each WCMS meets the conditions for a safe transfer of their content to a web archive for preservation purposes, and thus identify their strengths and weaknesses. More importantly, we deduce specific recommendations to improve the WA of each WCMS, aiming to advance the general practice of web data extraction and archiving.
CY  - Cham
DA  - 2015///
PY  - 2015
SP  - 198
EP  - 212
PB  - Springer International Publishing
SN  - 978-3-319-23135-8
UR  - http://link.springer.com/10.1007/978-3-319-23135-8_14
ER  - 

TY  - CHAP
TI  - Web Archive Profiling Through Fulltext Search
AU  - Alam, Sawood
AU  - Nelson, Michael L
AU  - Van de Sompel, Herbert
AU  - Rosenthal, David S H
T2  - TPDL 2016: Research and Advanced Technology for Digital Libraries
A2  - Fuhr, Norbert
A2  - Kovács, László
A2  - Risse, Thomas
A2  - Nejdl, Wolfgang
AB  - An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.
CY  - Cham
DA  - 2016///
PY  - 2016
SP  - 121
EP  - 132
PB  - Springer International Publishing
SN  - 978-3-319-43997-6
UR  - http://link.springer.com/10.1007/978-3-319-43997-6_10
KW  - Memento
KW  - Archive profiling
KW  - Random
KW  - searcher
KW  - Web archive
ER  - 

TY  - CHAP
TI  - Ontology-Based Automatic Annotation: An Approach for Efficient Retrieval of Semantic Results of Web Documents
AU  - Lakshmi Tulasi, R
AU  - Rao, Meda Sreenivasa
AU  - Ankita, K
AU  - Hgoudar, R
T2  - Proceedings of the First International Conference on Computational Intelligence and Informatics
A2  - Satapathy, Suresh Chandra
A2  - Prasad, V Kamakshi
A2  - Rani, B Padmaja
A2  - Udgata, Siba K
A2  - Raju, K Srujan
AB  - The Web contains large amount of data of unstructured nature which gives the relevant as well as irrelevant results. To remove the irrelevancy in results, a methodology is defined which would retrieve the semantic information. Semantic search directly deals with the knowledge base which is domain specific. Everyone constructs ontology knowledge base in their own way, which results in heterogeneity in ontology. The problem of heterogeneity can be resolved by applying the algorithm of ontology mapping. All the documents are collected by Web crawler from the Web and a document base is created. The documents are then given as an input for performing semantic annotation on the updated ontology. The results against the users query are retrieved from semantic information retrieval system after applying searching algorithm on it. The experiments conducted with this methodology show that the results thus obtained provide more accurate and precise information.
CY  - Singapore
DA  - 2017///
PY  - 2017
SP  - 331
EP  - 339
PB  - Springer Singapore
SN  - 978-981-10-2471-9
UR  - http://link.springer.com/10.1007/978-981-10-2471-9_32
ER  - 

TY  - CHAP
TI  - A Topic Transition Map for Query Expansion: A Semantic Analysis of Click-Through Data and Test Collections
AU  - Kim, Kyung-min
AU  - Jung, Yuchul
AU  - Myaeng, Sung-Hyon
T2  - AI 2016: Advances in Artificial Intelligence
A2  - Kang, Byeong Ho
A2  - Bai, Quan
AB  - Term mismatching between queries and documents has long been recognized as a key problem in information retrieval (IR). Based on our analysis of a large-scale web query log and relevant documents in standard test collections, we attempt to detect topic transitions between the topical categories of a query and those of relevant documents (or clicked pages) and create a Topic Transition Map (TTM) that captures how query topic categories are linked to those of relevant or clicked documents. TTM, a kind of click-graph at the semantic level, is then used for query expansion by suggesting the terms associated with the document categories strongly related to the query category. Unlike most other query expansion methods that attempt to either interpret the semantics of queries based on a thesaurus-like resource or use the content of a small number of relevant documents, our method proposes to retrieve documents in the semantic affinity of multiple categories of the documents relevant for the queries of a similar kind. Our experiments show that the proposed method is superior in effectiveness to other representative query expansion methods such as standard relevance feedback, pseudo relevance feedback, and thesaurus-based expansion of queries.
CY  - Cham
DA  - 2016///
PY  - 2016
SP  - 648
EP  - 664
PB  - Springer International Publishing
SN  - 978-3-319-50127-7
UR  - http://link.springer.com/10.1007/978-3-319-50127-7_57
KW  - Query expansion
KW  - Relevance feedback
KW  - Semantic categorization of terms
KW  - Topic Transition Map
ER  - 

TY  - CHAP
TI  - Migrating Web Archives from HTML4 to HTML5: A Block-Based Approach and Its Evaluation
AU  - Sanoja, Andrés
AU  - Gançarski, Stéphane
T2  - ADBIS 2017: Advances in Databases and Information Systems
A2  - Kirikova, Mārīte
A2  - Nørvåg, Kjetil
A2  - Papadopoulos, George A
AB  - Web archives (and the Web itself) are likely to suffer from format obsolescence. In a few years or decades, future Web browsers will no more be able to properly render Web pages written in HTML4 format. Thus we propose a migration tool from HTML4 to HTML5. This is challenging, because it requires to generate HTML5 semantic elements that do not exist in HTML4 pages. To solve this issue, we propose to use a Web page segmenter. Indeed, blocks generated by a segmenter are good candidates for being semantic elements as both reflect the content structure of the page. We use an evaluation framework for Web page segmentation, that helps defining and computing relevant metrics to measure the quality of the migration process. We ran experiments on a sample of 40 pages. The migrated pages we produce are compared to a ground truth. The automatic labeling of blocks is quite similar to the ground truth, though its quality depends on the type of page we migrate. When comparing the rendering of the original page and the rendering of its migrated version, we note some differences, mainly due to the fact that rendering engines do not (yet) properly render the content of semantic elements.
CY  - Cham
DA  - 2017///
PY  - 2017
SP  - 375
EP  - 393
PB  - Springer International Publishing
SN  - 978-3-319-66917-5
UR  - http://link.springer.com/10.1007/978-3-319-66917-5_25
KW  - archive
KW  - Blocks
KW  - Format obsolescence
KW  - HTML5
KW  - Migration
KW  - Segmentation
KW  - Web
ER  - 

TY  - JOUR
TI  - A cseh web és a kötelespéldány-rendelet
AU  - Celbová, Ludmila
AU  - Prókai, Margit
T2  - Könyvtári figyelő : külföldi lapszemle
AB  - Csehországban nincs jogszabály az elektronikus kötelespéldányok beszolgáltatásáról. 2000 óta foglalkoznak a nemzeti könyvtárban a webarchiválással, de a probléma nemzeti és nemzetközi szinten sem egyértelmű. Érinti a szerzői jogi törvényt, a nyomtatott kötelespéldányokról szóló szabályozást és a könyvtári törvény intézkedéseit.
DA  - 2009///
PY  - 2009
IS  - 3
SP  - 518
EP  - 520
L4  - http://epa.oszk.hu/00100/00143/00072/pdf/2009_3%20szam_referatumok.pdf#page=14
KW  - cikkreferátum
KW  - -könyvtárügyi
KW  - Elektronikus publikáció
KW  - Hozzáférhetőség
KW  - Jogszabály
KW  - Kötelespéldány
KW  - Megőrzés
ER  - 

TY  - JOUR
TI  - Crook, Edgar: Webarchiválás a webkettes világban
AU  - László, Drótos
AU  - Crook, Edgar
T2  - Tudományos és műszaki tájékoztatás
AB  - A National Library of Australia vezető szerepet játszik az ausztrál web begyűjtésében és megőrzésében 1996, a PANDORA archívum (pandora. nla.gov.au) létrehozása óta. Emellett léteznek más, szűkebb körű projektek is, mint például a tasmániai Our Digital Island (odi.statelibrary. tas.gov.au), vagy a kontinens Northern Territory nevű részén működő Territory Stories (territorystories.nt.gov.au). A nemzeti könyvtár jelenleg már háromféle módon archivál: a PANDORA gyűjteménybe szelektíven válogat online forrásokat, továbbá az Internet „Archive” segítségével a teljes .au domént learatja, valamint elkezdte használni az „Archive-It” szolgáltatást is. Elmondható tehát, hogy az ausztrál online tartalom jelentős részét sikerül így megmenteni a jövő számára. De a technológiai változások miatt a könyvtárnak folyamatosan alkalmazkodnia kell: fejleszteni az archiváló eszközeit, bővíteni a gyűjtött tartalmak körét és újabb partnerekkel szövetkezni, hogy eredményesen tudja folytatni ezt a fontos munkát.
DA  - 2010///
PY  - 2010
VL  - 57
IS  - 2
SP  - 78
EP  - 81
SN  - 0041-3917
L4  - http://epa.oszk.hu/03000/03071/00029/pdf/EPA03071_tmt_2010_02_068-081.pdf#page=11
KW  - webarchiválás
KW  - Ausztrália
ER  - 

TY  - JOUR
TI  - 404 Not Found - Ki őrzi meg az internetet; Webarchiválás workshop az Országos Széchényi Könyvtárban
AU  - Németh, Márton
T2  - Tudományos és műszaki tájékoztatás
AB  - 2017. október 13-án első alkalommal került sor kifejezetten a számítógépes világháló archiválásával foglalkozó rendezvényre az Országos Széchényi Könyvtárban (OSZK). Az intézmény és a Kormányzati Informatikai Ügynökség (KIFÜ) keretei között zajló Országos Könyvtári Rendszer (OKR) projekt egyik munkacsoportjaként idén tavasztól kezdhettünk el egy kísérleti projekt keretében foglalkozni a webarchiválással. (Bővebb információt a http://mekosztaly.oszk.hu/mia oldalon lehet találni erről.) Célunk az, hogy a projektidőszak végére egy olyan koncepcióval álljunk elő, mely lehetővé teszi, számos európai nemzeti könyvtár mintájára, az üzemszerű munkafolyamatként zajló webarchiválási tevékenység ellátását, illetve szervezését az OSZK részéről. Egy olyan rendszert kívánunk létrehozni, amely a kulturális örökség hosszú távú megőrzésének feladata mellett képes kiszolgálni az oktatás, a tudományos kutatás, az állami szervek, az üzleti szféra és az egyes internethasználók igényeit is. Az archívum megvalósulásával a most csak jelen időben létező magyar internetnek „múltja” is lenne, és olyan lehetőségek nyílnak meg a mai és a jövőbeli felhasználói számára, amelyek jelenleg nem, vagy csak nehézkesen valósíthatók meg (pl. megszűnt weboldalak megtalálása, webhelyek időbeli változásának elemzése és vizualizálása, stabil hivatkozhatóság, idődimenziót is tartalmazó szöveg- és adatbányászati alkalmazások futtatása, internettörténeti kutatások, hiteles másolatok szolgáltatása). A projekt első fél évét mintegy lezárva került sor rendezvényünkre. A program összeállításakor különös gondot fordítottunk a meglévő külföldi szakmai tapasztalatok, illetve az itthoni előzmények bemutatására. A workshop hangsúlyos céljaként szerepelt továbbá a teljes közgyűjteményi szféra (
DA  - 2017///
PY  - 2017
VL  - 64
IS  - 11
SP  - 577
EP  - 582
SN  - 0041-3917
L4  - http://epa.oszk.hu/03000/03071/00112/pdf/EPA03071_tmt_2017_11_577-582.pdf
KW  - webarchiválás
KW  - 404 workshop
KW  - MIA pilot projekt
KW  - Országos Széchényi Könyvtár
ER  - 

TY  - JOUR
TI  - Az első nyilvános webarchívum az Egyesült Királyságban
AU  - Bailey, Steve
AU  - Thomson, Dale
AU  - Szalóki, Gabriella
T2  - Tudományos és műszaki tájékoztatás
AB  - Sokak számára a web az elsődleges információforrás, eddig mégis kevés figyelmet fordítottak a weboldalak hosszú távú megőrzésére, ami azzal a veszéllyel jár, hogy felbecsülhetetlen tudományos és kulturális értékek vesznek el a jövő generációi számára. A probléma megoldására hat vezető brit intézmény dolgozik közösen egy tesztelési környezet kidolgozásán, amely alapján kiválaszthatók az archiválni kívánt weboldalak. A hat intézmény: Brit Nemzeti Levéltár, Brit Nemzeti Könyvtár, Közös Információs Rendszerek Bizottsága (JISC), a skót és a walesi nemzeti könyvtárak és a Wellcome Könyvtár, megalakította az Egyesült Királyság Webarchiválási Konzorciumát (UK Web Archiving Consortium = UKWAC). Az archiválásra az Ausztrál Nemzeti Könyvtár által kifejlesztett PANDAS (PANDORA Digital Archival System = Pandora Digitális Archiváló Rendszer) szoftvert használják. A partnerek az adott intézmény szakterületéhez kapcsolódó oldalakat mentik el.
DA  - 2006///
PY  - 2006
VL  - 53
IS  - 10
UR  - http://tmt-archive.omikk.bme.hu/show_news.html?id=4555&issue_id=476
KW  - webarchiválás
KW  - cikkreferátum
KW  - Nagy-Britannia
ER  - 

TY  - JOUR
TI  - Web-archívum made in Slovakia: Kísérleti projekt az elektronikus információforrások gyűjtésére és archiválására A web mint kulturális örökség
AU  - Androvič, Ing. Alojz
AU  - Prókai, Margit
T2  - Tudományos és műszaki tájékoztatás
AB  - A Cseh Nemzeti Könyvtár és a Pozsonyi Egyetemi Könyvtár (PEK) a CULTURE 2000 európai program keretében vállalta a web archiválási módszereinek, szempontjainak kidolgozását. Ezzel egyidejűleg Szlovákiában is megkezdték a web minőségi és mennyiségi felmérését, a szlovák nemzeti doménnel rendelkező weboldalak feltérképezését. 2006. májusi adatok szerint a szlovák nemzeti domén - .sk - keretében összesen 92 ezer doménnevet regisztrált mintegy 46 961 felhasználó.
DA  - 2007///
PY  - 2007
VL  - 54
IS  - 10
UR  - http://tmt-archive.omikk.bme.hu/show_news.html?id=4788&issue_id=487
KW  - webarchiválás
KW  - cikkreferátum
KW  - Szlovákia
ER  - 

TY  - JOUR
TI  - Az internet archiválása mint könyvtári feladat
AU  - Drótos, László
T2  - Tudományos és műszaki tájékoztatás
AB  - A nyilvános internetről minden nap tömeges méretekben letörölt vagy máshová költöző dokumentumok és egyéb információforrások egyre nagyobb problémát jelentenek a tudományos publikációkban és a tananyagokban való hivatkozhatóság szempontjából, de az átlagos internetező is állandóan belefut az eltűnt weboldalakat jelző 404-es hibákba. A világháló alapvetően egy jelen idejű médium, de legalább egy részét érdemes lenne megőrizni és kutathatóvá tenni a jövő generációi számára. Ez a cikk arra a kérdésre keresi a választ, hogy ki, mit, hogyan, mivel és miért mentsen az internetről, és hol van itt a könyvtárak és a könyvtárosok feladata és felelőssége? Bemutat néhány hasznos eszközt és szolgáltatást, majd röviden ismerteti a nemzetközi helyzetet és az OSZK-ban 2017 tavaszán elindult kísérleti webarchiválási projektet.
DA  - 2017///
PY  - 2017
VL  - 64
IS  - 7-8
SP  - 361
EP  - 371
SN  - 0041-3917
L4  - http://epa.oszk.hu/03000/03071/00109/pdf/EPA03071_tmt_2017_07_08_361-371.pdf
KW  - internet
KW  - archiválás
KW  - honlaptérkép
KW  - OSZK
ER  - 

TY  - JOUR
TI  - Az OSZK-ban folyó kísérleti webarchiválási projekt első évének tapasztalatai
AU  - Drótos, László
AU  - Németh, Márton
T2  - Tudományos és műszaki tájékoztatás
AB  - Az Országos Széchényi Könyvtárban az OKR (Országos Könyvtári Rendszer) kifejlesztése keretében 2017−2018 között zajlik egy kísérleti projekt azzal céllal, hogy Magyarországon is megteremtsük a nyilvános webhelyek tömeges archiválásának és hosszú távú megőrzésének feltételeit, elsősorban az ehhez a munkához szükséges informatikai infrastruktúrát és szakértelmet. Ezen a téren több mint 20 éves lemaradást kell ledolgoznunk, mert például az amerikai nonprofit szervezet, az Internet Archive (IA) már 1996 óta foglalkozik ezzel, és azóta példáját számos országban követték, létrehoztak nemzeti, kormányzati vagy intézményi webarchívumokat, gyakran könyvtári, levéltári irányítással vagy közreműködéssel. Az OSZK-ban a 2000-es évek közepén merült fel egy magyar internet archívum (MIA) ötlete, de az ezt előkészítő munka feltételei csak 2017 tavaszán kezdtek megteremtődni. Az egri Networkshop első napján rendezett műhelymunka vitaindító előadásában a 2018 áprilisáig eltelt egy év fejleményeiről számoltunk be, s ezeket az eredményeket és tapasztalatokat foglaljuk össze ebben a cikkben.
DA  - 2018///
PY  - 2018
VL  - 65
IS  - 7-8
SP  - 389
EP  - 400
SN  - 0041-3917
UR  - http://tmt.omikk.bme.hu/tmt/article/view/7153/8156
ER  - 

TY  - JOUR
TI  - A webarchiválásról történeti megközelítésben
AU  - Németh, Márton
T2  - Könyv, könyvtár, könyvtáros
AB  - A tanulmánykötet esettanulmányok formájában az elsők között tesz kísérletet arra, hogy felvillantsa a webarchiválás történeti, illetve széles társadalomtudományi kontextusának számos fontos elemét. A szerkesztők előszava is kitér rá, hogy eddig inkább az volt a jellemző, hogy magáról a webarchiválási folyamatról, annak technikai részleteiről, a világháló archiválásához kapcsolódó kurátori tevékenységekről szóltak az összefoglalók. A szerkesztők az előszóban ez alkalommal is a legfrissebb szakirodalom segítségével vázolják fel a webarchiválás általánosabb kontextusát, eddigi történetének kronológiáját és főbb szereplőit. Jellemzik a főbb intézményeket, melyek e tevékenységeket végzik. Az Internet Archive úttörő szerepe mellett rámutatnak arra, hogy míg egyes országokban egyetlen vezető intézmény köré csoportosul e tevékenység (például Dániában), addig máshol intézményi koordináció tapasztalható világosan elkülönülő szerepkörökkel (pl. Franciaország, Nagy-Britannia). Rövid tájékoztatást kapunk arról, hogy milyen szoftverháttérrel történik az anyagok begyűjtése, s milyen módszerekkel lehet az eltárolt webes információk visszakeresését biztosítani (pl. az Internet Archive által fejlesztett Wayback Machine szoftverrel URL címekre kereshetünk, ezt egészíti ki a teljesszövegű index szolgáltatás, már amelyik gyűjteményben éppen elérhető).
DA  - 2018///
PY  - 2018
VL  - 27
IS  - 2
SP  - 48
EP  - 52
SN  - 1216-6804
UR  - http://ki2.oszk.hu/3k/2018/06/a-webarchivalasrol-torteneti-megkozelitesben/
L1  - https://epa.oszk.hu/01300/01367/00299/pdf/EPA01367_3K_2018_02_048-052.pdf
KW  - adattudomány
KW  - könyvrecenzió
KW  - webtörténelem
ER  - 

TY  - JOUR
TI  - Webarchiválási politikák
AU  - Dancs, Szabolcs
T2  - Könyv, könyvtár, könyvtáros
AB  - MatarkaID=1671000
DA  - 2011///
PY  - 2011
VL  - 20
IS  - 10
SP  - 14
EP  - 20
SN  - 1216-6804
L4  - http://epa.oszk.hu/01300/01367/00248/pdf/EPA01367_3K_2011_10_14-20.pdf
KW  - webarchiválás
ER  - 

TY  - JOUR
TI  - Nemzetközi körkép a webarchiválás gyakorlatáról
AU  - Németh, Márton
T2  - Könyvtári figyelő
AB  - A webarchiválás olyan dinamikusan fejlődő terület, mely számos vonatkozásban már a korábbiakban is felbukkant a Könyvtári Figyelő hasábjain, különösen a nemzetközi szakirodalom szemlézése kapcsán. (Például 2014-ben Hegyközi Ilona tekintette át a webarchiválással kapcsolatos nemzetközi trendeket.) Úgy éreztük, eljött az ideje egy újabb összegzésnek. Ennek különös hangsúlyt ad, hogy számos korábbi kezdeményezést követően, idén tavasztól megteremtődtek az alapjai az OSZK fejlesztési projektjén belül egy olyan kísérleti projekt elindításának, melyben felmérjük a webarchiváláshoz szükséges hardver és szoftver igényeket, valamint szakmai ismereteket. A fő cél, hogy jól megalapozott módon integrálni tudjuk e területet hosszú távon is az OSZK szolgáltatási tevékenységei közé. Az OSZK Elektronikus Könyvtári Szolgáltatások Osztályán létrehoztunk egy Magyar Internet Archívum honlapot (http://mekosztaly.oszk.hu/mia), melyen tanulmányozhatók a webarchiválás különféle módszerei, alapfogalmai, meg a nemzetközi szakirodalom. Továbbá a projekttel kapcsolatos aktuális információkkal is szolgálunk és fel lehet iratkozni a webarchiválás szakmai kérdéseit tárgyaló levelezőlistára is. Ennek a cikknek nem az a célja tehát, hogy a webarchiválási tevékenységek szakmai alapjait járja körül (amelyre a honlapot böngészve nyílik lehetőség), hanem, hogy áttekintést adjunk a webarchiválási szolgáltatásokat megalapozó nemzetközi jó gyakorlatokból.
DA  - 2017///
PY  - 2017
VL  - 63
IS  - 4
SP  - 575
EP  - 582
SN  - 0023-3773
L4  - http://epa.oszk.hu/00100/00143/00349/pdf/EPA00143_konyvtari_figyelo_2017_04_575-582.pdf
KW  - webarchiválás
KW  - nemzetközi körkép
ER  - 

TY  - JOUR
TI  - Webtörténetírás az Internet Archive-ból készített képernyővideókkal
AU  - Drótos, László
T2  - Tudományos és műszaki tájékoztatás
AB  - A globális Internet Archive és a nemzeti webarchívumok a digitális történelem kutatásának fő forrásai, mivel összegyűjtik és megőrzik az eleve digitális formában születő kultúrát, s így olyan tartalmakat lehet megtalálni bennük, amelyek sehol máshol nem kutathatók. Az 1990-es évek második felétől már elképzelhetetlen teljes körűen megírni valaminek a történetét kizárólag csak papírújságokra és -könyvekre alapozva, figyelmen kívül hagyva a téma internetes lenyomatait.
DA  - 2017///
PY  - 2017
VL  - 64
IS  - 7-8
SP  - 397
EP  - 401
SN  - 0041-3917
L4  - http://epa.oszk.hu/03000/03071/00109/pdf/EPA03071_tmt_2017_07_08_397-401.pdf
KW  - Internet Archive
KW  - képernyővideó
KW  - webtörténetírás
ER  - 

TY  - CONF
TI  - Passing on the Lessons of the Great East Japan Earthquake to Future Generations—The National Diet Library Great East Japan Earthquake Archive
AU  - INOUE, Sachiko
AB  - In the aftermath of the Great East Japan Earthquake, which struck on March 11, 2011, the Japanese government recognized an urgent need to create a national archive of information about this unprecedented natural disaster, so that the learned lessons from this experience would not be lost. Having an obligation as a national library to collect, preserve, and share materials that record all aspects of Japan’s cultural heritage, the National Diet Library (NDL), in cooperation with other Japanese government agencies, has responded to this need by creating a portal site, called HINAGIKU, through which researchers can search and access a wide variety of earthquake archives. In this paper, I will report on our achievements as well as the challenges we face in configuring HINAGIKU to facilitate access to documentation published or archived primarily by the national and municipal government agencies. At present, HINAGIKU enables access to materials documenting both past experience and current disaster prevention planning via an integrated search functionality of multiple digital archives established by municipal governments, academic institutions, the Ministry of Internal Affairs and Communications, and other organizations as well as the NDL. Visitors to HINAGIKU are able to search records stored at the NDL and other institutions, and new knowledge generated from such research can also be integrated into HINAGIKU as new content. Over time, as interest in earthquake-related materials decreases, it becomes imperative that the NDL acquire and preserve these materials before such archives disappear. The NDL also has a role to play in handing down these most valuable records to future generations by managing issues related to copyright, personality rights, and secondary use, thereby making HINAGIKU even more useful.
C1  - Kuala Lumpur
C3  - IFLA WLIC 2018 – Kuala Lumpur, Malaysia – Transform Libraries, Transform Societies in Session 233 - Government Information and Official Publications.
DA  - 2018///
PY  - 2018
PB  - IFLA
UR  - http://library.ifla.org/id/eprint/2217
KW  - disaster archive
KW  - Great East Japan Earthquake Archive
KW  - metadata
KW  - portal site
KW  - rights handling
ER  - 

TY  - CONF
TI  - Web archiving issues and challenges in State Government of Sarawak (Malaysia): Do they really need their website to be archived?
AU  - Jassalini Jamain;
AU  - Yahya, Ayu Lestari
AU  - Muhammad, Natalia
AU  - Musa Ayob Abdul Rahman
A2  - IFLA
AB  - Sarawak State Web Archive (SSWA) is Sarawak State Library’s (Pustaka) initiative. Website contents of the Sarawak State Civil Service (SSCS) entities obtained from World Wide Web (WWW), are archived for the purpose of preserving non-library resources, as part of the Legal Deposit requirements of Sarawak State Library Ordinance, 1999. Web preservation is considered as a common practice at international level, whereas in Malaysia this is still at a minimal level. Since 2009, Pustaka has been harvesting 132 websites of Sarawak State Government departments and agencies. However, Pustaka faced challenges in performing web archiving works. This paper focuses on the general issues and challenges in preserving corporate information heritage to make it available for future reference.
C1  - Kuala Lumpur
C3  - IFLA WLIC 2018 – Kuala Lumpur, Malaysia – Transform Libraries, Transform Societies in Session 160 - Preservation and Conservation with Information Technology.
DA  - 2018///
PY  - 2018
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/id/eprint/2115
KW  - Web archiving
KW  - corporate information heritage
KW  - legal deposit
KW  - Malaysia
KW  - Sarawak
ER  - 

TY  - CONF
TI  - Preserving cultural heritage: Better together!
AU  - Signori, Barbara
AB  - The Swiss National Library has a mandate to collect, catalogue, store and disseminate the cultural heritage created in Switzerland and abroad by and about the Swiss, both in print and digital. This sounds like a clear enough mission, but dig deeper and this mandate raises all sorts of tough questions. What exactly is cultural heritage? Obviously, it goes far beyond e-books and e-journals of well-established Swiss publishers. It is Swiss websites, newsletters of Swiss societies, and so on. However, what about all the digital data that is created by Swiss people every waking moment? The selfies, blogs, tweets, social media, personal digital archives. Surely not everything can be considered cultural heritage. But who decides what is and what isn’t? And then how do we cope with the enormous quantity of information being produced? How can we decide what to keep for future generations when we cannot even cope with the output of the current generation? Not to mention the costs. With budgets being cut all the time, what does that mean for our cultural heritage? Amidst all these tough questions, one thing is clear: no single institution can possibly cope with collecting all that information nor be tasked with the decision on what to preserve and what not. This paper will use the example of Web Archive Switzerland to show how trust and interoperability have led to constructive collaboration. Web Archive Switzerland was born in 2008 following 5 years of discussion with the cantonal libraries. Since then websites with a bearing on Switzerland have been selected, documented, preserved and disseminated collaboratively among 30 Swiss institutions. The key lesson learned over the past 14 years is that to answer the tough questions and challenges we had to look beyond our own walls and borders. We learned to let go of the idea that we can do it alone, that we can control the world of content through clever curation. We learned how to create partnerships and strong networks of institutions, how to engage new sorts of curators, how to trust each other and share synergies and costs, all with the common goal of saving as much digital heritage as possible. In summary, this paper is a call to arms to join forces, to forge partnerships, to bundle competences, and to build collaborative networks! It will show that curating collaboration between institutions is as important as curating cultural heritage and it will suggest ways forward to create more collaborative collections of digital cultural heritage within Switzerland and beyond.
C1  - Wrocław
C3  - IFLA WLIC 2017 – Wrocław, Poland – Libraries. Solidarity. Society. in Session S08 - Satellite Meeting: Preservation and Conservation Section joint with the Association International Francophone des Bibliothécaires et Documentalistes (AIFBD) in collaborati
DA  - 2017///
PY  - 2017
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/id/eprint/1801
KW  - web archiving
KW  - collaboration
KW  - cultural heritage
KW  - preservation
KW  - Switzerland
ER  - 

TY  - CONF
TI  - Here Today, Gone within a Month: The Fleeting Life of Digital News
AU  - Halbert, Martin
AU  - Skinner, Katherine
AU  - Wilson, Marc
AU  - Zarndt, Frederick
AB  - In 1989 on the shores of Montana’s beautiful Flathead Lake, the owners of the weekly newspaper the Bigfork Eagle started TownNews.com to help community newspapers with developing technology. TownNews.com has since evolved into an integrated digital publishing and content management system used by more than 1600 newspaper, broadcast, magazine, and web-native publications in North America. TownNews.com is now headquartered on the banks of the mighty Mississippi river in Moline Illinois. Not long ago Marc Wilson, CEO of TownNews.com, noticed that of the 220,000+ e-edition pages posted on behalf of its customers at the beginning of the month, 210,000 were deleted by month’s end. What? The front page story about a local business being sold to an international corporation that I read online September 1 will be gone by September 30? As well as the story about my daughter’s 1st place finish in the district field and track meet? A 2014 national survey by the Reynolds Journalism Institute (RJI) of 70 digital-only and 406 hybrid (digital and print) newspapers conclusively showed that newspaper publishers also do not maintain archives of the content they produce. RJI found a dismal 12% of the “hybrid” newspapers reported even backing up their digital news content and fully 20% of the “digital-only” newspapers reported that they are backing up none of their content. Educopia Institute’s 2012 and 2015 surveys with newspapers and libraries concur, and further demonstrate that the longstanding partner to the newspaper—the library—likewise is neither collecting nor preserving this digital content. This leaves us with a bitter irony, that today, one can find stories published prior to 1922 in the Library of Congress’s Chronicling America and other digitized, out-of-copyright newspaper collections but cannot, and never will be able to, read a story published online less than a month ago. In this paper we look at how much news is published online that is never published in print or on more permanent media. We estimate how much online news is or will soon be forever lost because no one preserves it: not publishers, not libraries, not content management systems, and not the Internet Archive. We delve into some of the reasons why this content is not yet preserved, and we examine the persistent challenges of digital preservation and of digital curation of this content type. We then suggest a pathway forward, via some initial steps that journalists, producers, legislators, libraries, distributors, and readers may each take to begin to rectify this historical loss going forward.
C1  - Lexington, KY, USA
C3  - IFLA WLIC 2016 – Columbus, OH – Connections. Collaboration. Community in Session S21 - Satellite Meeting: News Media. In: News, new roles & preservation advocacy: moving libraries into action, 10-12 August 2016, Lexington, KY, USA.
DA  - 2016///
PY  - 2016
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/id/eprint/2077
KW  - preservation
KW  - news
KW  - born digital news
KW  - e-edition
KW  - newspapers
ER  - 

TY  - CONF
TI  - Preparing to Preserve: Three Essential Steps to Building Experience with Long-Term Digital Preservation
AU  - Lampert, Cory K.
AU  - Vaughan, Jason
AB  - Many organizations face complex questions of how to implement affordable and sustainable digital preservation practices. One strategic priority at the University Libraries at the University of Nevada-Las Vegas, United States, is increased focus toward preservation of unique digital assets, whether digitized from physical originals or born digital. A team comprised of experts from multiple functional library departments (including the special collections/archives area and the technology area) was established to help address this priority, and efforts are beginning to translate into operational practice. This work outlines a three-step approach: Partnership, Policy, Pilot taken by one academic research library to strategically build experience utilizing a collaborative team approach. Our experience included the formation of a team, education of all members, and a foundational attitude that decisions would be undertaken as partners rather than competing departments or units. The team’s work included the development of an initial digital preservation policy, helping to distill the organizational priority and values associated with digital preservation. Several pilot projects were initiated and completed, which provided realistic, first-person experience with digital preservation activities, surfaced questions, and set the stage for developing and refining sustainable workflows. This work will highlight key activities in our journey to date, with the hope that experience gained through this effort could be applicable, in whole or part, to other organizations regardless of their size or capacity.
C1  - Kuala Lumpur
C3  - IFLA WLIC 2018 – Kuala Lumpur, Malaysia – Transform Libraries, Transform Societies in Session 160 - Preservation and Conservation with Information Technology.
DA  - 2018///
PY  - 2018
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/id/eprint/2114
KW  - archives
KW  - digital preservation
KW  - partnerships
KW  - policy
KW  - technology
ER  - 

TY  - CONF
TI  - Where should the culture of our lives and memory be preserved? - Rethinking the role of the library.
AU  - Lee, Jaesun
AB  - The abilities to store and transfer memory, to learn from others’ experiences, or to share one’s knowledge with the world are the drivers of social development. This driving force derives from the library’s unique function and role to collect and service cultural assets. The National Library of Korea has recently expanded its scope of collection from printed media to online materials and broadcasting contents, and it opened its Memory Museum. The National Library of Korea has successfully demonstrated the example of a sustainable library in the new paradigm by strengthening its ability to preserve cultural memories. Meanwhile, public libraries in Korea have taken an initiative to preserve and transfer memory of a local community, which encourages locals’ participation and revitalizes community spirit that has disappeared as a result of rapid economic growth. In this paper, cases of integrating a museum’s archiving function into a library that led to social integration and community revitalization will be introduced; in addition, the paper will argue where and how the culture of our lives and memory should be preserved and utilized.
C1  - Wrocław
C3  - IFLA WLIC 2017 – Wrocław, Poland – Libraries. Solidarity. Society. in Session 189 - Asia and Oceania.
DA  - 2017///
PY  - 2017
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/1691/
L4  - http://library.ifla.org/id/eprint/1691
KW  - community revitalization
KW  - cultural reproduction
KW  - Library service
KW  - memory archiving
KW  - social inclusion
ER  - 

TY  - CONF
TI  - Born Digital Legal Deposit Policies and Practices
AU  - Zarndt, Frederick
AU  - Carner, Dorothy
AU  - McCain, Edward
AB  - In 2014, the authors surveyed the born digital content legal deposit policies and practices in 17 different countries and presented the results of the survey at the 2015 International News Media Conference hosted by the National Library of Sweden in Stockholm, Sweden, April 15-16, 2015. Three years later, the authors expanded their team and updated the survey in order to assess progress in creating or improving national policies and in implementing practices for preserving born digital content. The 2017 survey reach has been broadened to include countries that did not participate in the 2014 survey. To optimise survey design, and allow for comparability of results with previous surveys, the authors briefly review 17 efforts over the last 12 years to understand the state of digital legal deposit and broader digital preservation policies (a deeper analysis will be provided in a future paper), and then set out the logic behind the current survey.
C1  - Wrocław
C3  - IFLA WLIC 2017 – Wrocław, Poland – Libraries. Solidarity. Society. in Session S18 - Satellite Meeting: News Media Section.
DA  - 2017///
PY  - 2017
PB  - IFLA -- International Federation of Library Associations and Institutions
UR  - http://library.ifla.org/1905/
KW  - web archiving
KW  - digital preservation
KW  - E-legal deposit
KW  - survey
ER  - 

TY  - CONF
TI  - Reconstruction of the US First Website
AU  - AlSum, Ahmed
AB  - The Web idea started on 1989 with a proposal from Sir Tim Berners-Lee. The first US website has been developed at SLAC on 1991. This early version of the Web and the subsequent updates until 1998 have been preserved by SLAC archive and history office for many years. In this paper, we discuss the strategy and techniques to reconstruct this early website and make it available through Stanford Web Archive Portal.
C1  - New York, New York, USA
C3  - Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2756406.2756954
SP  - 285
EP  - 286
PB  - ACM Press
SN  - 978-1-4503-3594-2
UR  - http://dl.acm.org/citation.cfm?doid=2756406.2756954
KW  - Web Archiving
ER  - 

TY  - CONF
TI  - Preserving digital legal deposit - new challenges and opportunities
AU  - Ruusalep, Raivo
AB  - Born digital content has always been considered to be a bigger challenge for preservation than digitised content. Higher volume and technical complexity, dynamism as well as a complex surrounding rights space are frequently cited as aspects that make born digital content ‘special’ to memory institutions. This paper builds on the Estonian case of introducing digital legal deposit which has led to an exercise of reconceptualising the digital preservation function of the national library. The rapid increase in volume, file size and new file formats have led to making the library’s preservation service levels explicit, an update to the preservation policy and automation of archiving workflows. The new demands on preservation are pushing the current digital repository system of the national library to its limits and the library needs to embark on migrating to a new preservation solution. This response to a sudden change in digital preservation workload is typical in the heritage sector – upgrading the ingest component is the first instinctive reaction of most memory institutions. This paper proposes that increasing the throughput of ingest component needs to be combined with a modular concept of a preservation system that sets interoperability as its core principle. When digital preservation is conceptualised as an exercise of resilience rather than sustainability, the interoperability requirement for systems architecture and service design follows logically.
C1  - Wrocław
C3  - IFLA WLIC 2017 – Wrocław, Poland – Libraries. Solidarity. Society. in Session 210 - Preservation and Conservation (PAC) Strategic Programme.
DA  - 2017///
PY  - 2017
PB  - IFLA
L4  - http://library.ifla.org/1677/1/210-ruusalepp-en.pdf
KW  - digital legal deposit
KW  - preservation of born digital content
KW  - resilience
ER  - 

TY  - JOUR
TI  - Preserving Meaning, Not Just Objects: Semantics and Digital Preservation
AU  - David Dubin
AU  - Joe Futrelle
AU  - Joel Plutchak
AU  - Janet Eke
T2  - Library Trends
AB  - The ECHO DEPository project is a digital preservation research and development project funded by the National Digital Information Infrastructure and Preservation Program (NDIIPP) and administered by the Library of Congress. A key goal of this project is to investigate both practical solutions for supporting digital preservation activities today, and the more fundamental research questions underlying the development of the next generation of digital preservation systems. To support on-the-ground preservation efforts in existing technical and organizational environments, we have developed tools to help curators collect and manage Web-based digital resources, such as the Web Archives Workbench (Kaczmarek et al., 2008), and to enhance existing repositories' support for interoperability and emerging preservation standards, such as the Hub and Spoke Tool Suite (Habing et al., 2008). In the longer term, however, we recognize that successful digital preservation activities will require a more precise and complete account of the meaning of relationships within and among digital objects. This article describes project efforts to identify the core underlying semantic issues affecting long-term digital preservation, and to model how semantic inference may help next-generation archives head off long-term preservation risks. [ABSTRACT FROM AUTHOR]
DA  - 2009///
PY  - 2009
DO  - 10.1353/lib.0.0054
VL  - 57
IS  - 3
SP  - 595
EP  - 610
SN  - 1559-0682
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://muse.jhu.edu/content/crossref/journals/library_trends/v057/57.3.dubin.html
KW  - Web archiving
KW  - Information science
KW  - Web archives
KW  - Digital libraries
KW  - Digitization of archival materials
KW  - Digital preservation
KW  - Digitization
KW  - Library science
KW  - Digitization of library materials
KW  - Archives -- Computer network resources
KW  - Preservation of materials
ER  - 

TY  - GEN
TI  - Descriptive metadata for web archiving: Literature review of user needs
AU  - OCLC
AB  - Under the auspices of the OCLC Research Library Partnership Web Archiving Metadata Working Group, this document is a literature review to inform the development of descriptive metadata best practices for archived web content that would meet end-user needs, enhance discovery, and improve metadata consistency. Selected readings include -- at minimum -- a substantive section related to metadata, but most covered a wider swath of issues. This helped the Working Group to learn much else about who the users of web archives are, the strategies they use and the challenges they face.
DA  - 2018///
PY  - 2018
PB  - OCLC Research
UR  - https://www.oclc.org/research/publications/2018/oclcresearch-descriptive-metadata/recommendations.html
Y2  - 2020/08/14/
KW  - Web archiving
KW  - Archives
KW  - Electronic information resources--Management
KW  - Library metadata
ER  - 

TY  - CONF
TI  - A browser for browsing the past web
AU  - Jatowt, Adam
AU  - Kawai, Yukiko
AU  - Nakamura, Satoshi
AU  - Kidawara, Yutaka
AU  - Tanaka, Katsumi
AB  - We describe a browser for the past web. It can retrieve data from multiple past web resources and f eatures a passive browsing style based on change detection and pr esentation. The browser shows past pages one by one along a tim e line. The parts that were changed between consecutive page versions are animated to reflect their deletion or insertion, thereby drawing the user’s attention to them. The browser enables automatic skipping of changeless periods and filtered br owsing based on user specified query
C1  - New York, New York, USA
C3  - Proceedings of the 15th international conference on World Wide Web - WWW '06
DA  - 2006///
PY  - 2006
DO  - 10.1145/1135777.1135923
SP  - 877
PB  - ACM Press
SN  - 1-59593-323-9
UR  - http://portal.acm.org/citation.cfm?doid=1135777.1135923
KW  - web archives
KW  - Past web
KW  - web archive browsing
ER  - 

TY  - CONF
TI  - Can we find documents in web archives without knowing their contents?
AU  - Vo, Khoi Duy
AU  - Tran, Tuan
AU  - Nguyen, Tu Ngoc
AU  - Zhu, Xiaofei
AU  - Nejdl, Wolfgang
AB  - Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and ranking methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem. In this work, we conduct a first analysis on different ranking strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from "bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking metadata into account.
C1  - New York, New York, USA
C3  - Proceedings of the 8th ACM Conference on Web Science - WebSci '16
DA  - 2016///
PY  - 2016
DO  - 10.1145/2908131.2908165
SP  - 173
EP  - 182
PB  - ACM Press
SN  - 978-1-4503-4208-7
UR  - http://dl.acm.org/citation.cfm?doid=2908131.2908165
KW  - Feature Analysis
KW  - Temporal Ranking
KW  - Web Archive Search
ER  - 

TY  - CONF
TI  - Bots, Seeds and People: Web Archives As Infrastructure
AU  - Summers, Ed
AU  - Punzalan, Ricardo
AB  - The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community.
C1  - New York, NY, USA
C3  - Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing
DA  - 2017/11/08/
PY  - 2017
DO  - 10.1145/2998181.2998345
SP  - 821
EP  - 834
PB  - ACM
SN  - 978-1-4503-4335-0
UR  - http://doi.acm.org/10.1145/2998181.2998345
L4  - http://arxiv.org/abs/1611.02493
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - collaboration
KW  - archive
KW  - Computer Science - Digital Libraries
KW  - design
KW  - H.3.7
KW  - K.4.3
KW  - practice
KW  - web
ER  - 

TY  - JOUR
TI  - Detecting off-topic pages within TimeMaps in Web archives.
AU  - AlNoamany, Yasmin
AU  - Weigle, Michele C
AU  - Nelson, Michael L
T2  - International Journal on Digital Libraries
AB  - Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy. We do not delete off-topic pages (they remain part of the collection), but they are flagged as off-topic so they can be excluded for consideration for downstream services, such as collection summarization and thumbnail generation. We propose different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length) to detect when a page has gone off-topic. Those predicted off-topic pages will be presented to the collection's curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, $$F_{1}$$ score = 0.906, and AUC $$=$$ 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting off-topic pages in the collections is 0.89. [ABSTRACT FROM AUTHOR]
DA  - 2016/09//
PY  - 2016
DO  - http://dx.doi.org/10.1007/s00799-016-0183-5
VL  - 17
IS  - 3
SP  - 203
EP  - 221
LA  - English
SN  - 14325012
UR  - https://search.proquest.com/docview/1811905000?accountid=27464
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://10.0.3.239/s00799-016-0183-5
KW  - Web archiving
KW  - Internet Archive
KW  - ARCHIVES
KW  - WEB archives
KW  - Library And Information Sciences--Computer Applica
KW  - Archives & records
KW  - Internet
KW  - Data mining
KW  - INFORMATION retrieval
KW  - Archived collections
KW  - Document filtering
KW  - Document similarity
KW  - Filtering systems
KW  - HTTP (Computer network protocol)
KW  - Information retrieval
KW  - UNIFORM Resource Identifiers
KW  - Web content mining
ER  - 

TY  - JOUR
TI  - Negotiating the Web of the Past
AU  - Schafer, Valerie Valérie
AU  - Musiani, Francesca
AU  - Borelli, Marguerite
T2  - French Journal for Media Research
AB  - The material, practical, theoretical elements of Web archiving as an ensemble of practices and a terrain of inquiry are inextricably entwined. Thus, its processes and infrastructures – often discreet and invisible – are increasingly relevant. Approaches inspired by Science and Technology Studies (STS) can contribute to shed light on the shaping of Web archives.
DA  - 2016///
PY  - 2016
LA  - English
UR  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L1  - https://hal.archives-ouvertes.fr/hal-01654218/file/schafer_pdf.pdf
L4  - http://orbilu.uni.lu/
L4  - http://frenchjournalformediaresearch.com/lodel/index.php?id=963
KW  - Web archiving
KW  - Web archives
KW  - [ SHS.INFO ] Humanities and Social Sciences/Librar
KW  - Archivage du Web
KW  - Archives du Web
KW  - Born Digital Heritage
KW  - Born-Digital Heritage
KW  - gouvernance
KW  - governance
KW  - info:eu-repo/semantics/article
KW  - Patrimoine nativement numérique
KW  - STS
ER  - 

TY  - JOUR
TI  - Copyright Challenges of Legal Deposit and Web Archiving in the National Library of Singapore.
AU  - Cadavid, Jhonny Antonio Pabón
AU  - PABÓN CADAVID, JHONNY ANTONIO
T2  - Alexandria
AB  - This article discusses the development of web archiving in Singapore and its relationship to copyright law. The author describes legal deposit, its definition and historical development, the differences between voluntary and compulsory legal deposit, and the practices of such approaches within the National Library of Singapore. It highlights two main projects, the Singapore Memory Project and Web Archive Singapore (WAS). The paper analyses how the implementation of legal deposit for preserving web material creates a complex relationship between copyright and digital heritage, and describes difficulties that cover the information lifecycle of web archiving. Finally, the paper presents a set of conclusions and recommendations regarding the need for modifying copyright legislation to foster research activities within Singapore's knowledge economy. [ABSTRACT FROM AUTHOR]
DA  - 2014/03//
PY  - 2014
VL  - 25
IS  - 1/2
SP  - 1
EP  - 19
LA  - English
SN  - 09557490
UR  - http://10.0.28.59/ALX.0017
L4  - https://search.proquest.com/docview/1623365662?accountid=27464
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
L4  - http://journals.sagepub.com/doi/pdf/10.7227/ALX.0017
KW  - Web archiving
KW  - web archives
KW  - Web archives
KW  - legal deposit
KW  - Library And Information Sciences
KW  - etc.
KW  - Legal deposit of books
KW  - copyright
KW  - Copyright of digital media
KW  - national libraries
KW  - National Library (Singapore)
KW  - Singapore
ER  - 

TY  - CONF
TI  - Routing Memento Requests Using Binary Classifiers
AU  - Bornand, Nicolas J.
AU  - Balakireva, Lyudmila
AU  - de Sompel, Herbert
AU  - Van de Sompel, Herbert
AB  - The Memento protocol provides a uniform approach to query individual web archives. Soon after its emergence, Memento Aggregator infrastructure was introduced that supports querying across multiple archives simultaneously. An Aggregator generates a response by issuing the respective Memento request against each of the distributed archives it covers. As the number of archives grows, it becomes increasingly challenging to deliver aggregate responses while keeping response times and computational costs under control. Ad-hoc heuristic approaches have been introduced to address this challenge and research has been conducted aimed at optimizing query routing based on archive profiles. In this paper, we explore the use of binary, archive-specific classifiers generated on the basis of the content cached by an Aggregator, to determine whether or not to query an archive for a given URI. Our results turn out to be readily applicable and can help to significantly decrease both the number of requests and the overall response times without compromising on recall. We find, among others, that classifiers can reduce the average number of requests by 77% compared to a brute force approach on all archives, and the overall response time by 42% while maintaining a recall of 0.847.
C1  - New York, NY, USA
C3  - Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
DA  - 2016///
PY  - 2016
DO  - 10.1145/2910896.2910899
SP  - 63
EP  - 72
PB  - ACM
SN  - 978-1-4503-4229-2
UR  - http://dl.acm.org/citation.cfm?doid=2910896.2910899
L4  - http://doi.acm.org/10.1145/2910896.2910899
KW  - web archiving
KW  - memento
KW  - machine learning
KW  - request routing
ER  - 

TY  - JOUR
TI  - Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives.
AU  - Milligan, Ian
T2  - International Journal of Humanities & Arts Computing: A Journal of Digital Humanities
AB  - Contemporary and future historians need to grapple with and confront the challenges posed by web archives. These large collections of material, accessed either through the Internet Archive's Wayback Machine or through other computational methods, represent both a challenge and an opportunity to historians. Through these collections, we have the potential to access the voices of millions of non-elite individuals (recognizing of course the cleavages in both Web access as well as method of access). To put this in perspective, the Old Bailey Online currently describes its monumental holdings of 197,745 trials between 1674 and 1913 as the 'largest body of texts detailing the lives of non-elite people ever published.' GeoCities.com, a platform for everyday web publishing in the mid-to-late 1990s and early 2000s, amounted to over thirty-eight million individual webpages. Historians will have access, in some form, to millions of pages: written by everyday people of various classes, genders, ethnicities, and ages. While the Web was not a perfect democracy by any means - it was and is unevenly accessed across each of those categories - this still represents a massive collection of non-elite speech. Yet a figure like thirty-eight million webpages is both a blessing and a curse. We cannot read every website, and must instead rely upon discovery tools to find the information that we need. Yet these tools largely do not exist for web archives, or are in a very early state of development: what will they look like? What information do historians want to access? We cannot simply map over web tools optimized for discovering current information through online searches or metadata analysis. We need to find information that mattered at the time, to diverse and very large communities. Furthermore, web pages cannot be viewed in isolation, outside of the networks that they inhabited. In theory, amongst corpuses of millions of pages, researchers can find whatever they want to confirm. The trick is situating it into a larger social and cultural context: is it representative? Unique? In this paper, 'Lost in the Infinite Archive,' I explore what the future of digital methods for historians will be when they need to explore web archives. Historical research of periods beginning in the mid-1990s will need to use web archives, and right now we are not ready. This article draws on first-hand research with the Internet Archive and Archive-It web archiving teams. It draws upon three exhaustive datasets: the large Web ARChive (WARC) files that make up Wide Web Scrapes of the Web; the metadata-intensive WAT files that provide networked contextual information; and the lifted-straight-from-the-web guerilla archives generated by groups like Archive Team. Through these case studies, we can see - hands-on - what richness and potentials lie in these new cultural records, and what approaches we may need to adopt. It helps underscore the need to have humanists involved at this early, crucial stage. [ABSTRACT FROM AUTHOR]
DA  - 2016/03//
PY  - 2016
VL  - 10
IS  - 1
SP  - 78
EP  - 94
SN  - 17538548
UR  - http://10.0.13.38/ijhac.2016.0161
L4  - http://www.euppublishing.com/doi/abs/10.3366/ijhac.2016.0161
L4  - http://search.ebscohost.com/login.aspx?authtype=ip,cookie,cpid&custid=s6213251&groupid=main&profile=eds
KW  - WEB archiving
KW  - RESEARCH
KW  - WORLD Wide Web
KW  - ARCHIVES -- Computer network resources
KW  - WEB archives
KW  - archive
KW  - digital history
KW  - HISTORIANS
KW  - historical studies
KW  - WEB archives -- Research
KW  - webscraping
KW  - world wide web
KW  - WORLD Wide Web -- Research
ER  - 

TY  - JOUR
TI  - The impact of JavaScript on archivability
AU  - Brunelle, Justin F.
AU  - Kelly, Mat
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
T2  - International Journal on Digital Libraries
AB  - Web Archiving Integration Layer (WAIL) is a desktop application writ- ten in Python that integrates Heritrix and OpenWayback. In this work we recreate and extend WAIL from the ground up to facilitate collection- based personal Web archiving. Our new iteration of the software, WAIL- Electron, leverages native Web technologies (e.g., JavaScript, Chromium) using Electron to open new potential for Web archiving by individ- uals in a stand-alone cross-platform native application. By replacing OpenWayback with PyWb, we provide a novel means for personal Web archivists to curate collections of their captures from their own per- sonal computer rather than relying on an external archival Web service. As extended features we also provide the ability for a user to monitor and automatically archive Twitter users’ feeds, even those requiring authentication, as well as provide a reference implementation for in- tegrating a browser-based preservation tool into an OS native applica- tion.
DA  - 2016/06/25/
PY  - 2016
DO  - 10.1007/s00799-015-0140-8
VL  - 17
IS  - 2
SP  - 95
EP  - 117
SN  - 1432-5012
UR  - http://10.0.3.239/s00799-015-0140-8
L4  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=115054901&lang=hu&site=ehost-live
L4  - http://link.springer.com/10.1007/s00799-015-0140-8
KW  - Web archiving
KW  - Digital preservation
KW  - Websites
KW  - ACCESS control
KW  - Automation
KW  - Browser-Based Preser-
KW  - JavaScript (Computer program language)
KW  - Personal Web Archiving
KW  - Scripts
KW  - Twitter (Web resource)
KW  - vation
KW  - Web architecture
KW  - Web Archive Collections
ER  - 

TY  - CONF
TI  - A Time-aware Random Walk Model for Finding Important Documents in Web Archives
AU  - Nguyen, Tu Ngoc
AU  - Kanhabua, Nattiya
AU  - Niederée, Claudia
AU  - Zhu, Xiaofei
AB  - Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, web archives are emerging as gold-mines for content analytics of many sorts. However, supporting search, which goes beyond navigational search via URLs, is a very challenging task in these unique structures with huge, redundant and noisy temporal content. In this paper, we address the search needs of expert users such as journalists, economists or historians for discovering a topic in time: Given a query, the top-k returned results should give the best representative documents that cover most interesting time-periods for the topic. For this purpose, we propose a novel random walk-based model that integrates relevance, temporal authority, diversity and time in a unified framework. Our preliminary experimental results on the large-scale real-world web archival collection shows that our method significantly improves the state-of-the-art algorithms (i.e., PageRank) in ranking temporal web pages.
C1  - New York, New York, USA
C3  - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15
DA  - 2015///
PY  - 2015
DO  - 10.1145/2766462.2767832
SP  - 915
EP  - 918
PB  - ACM Press
SN  - 978-1-4503-3621-5
UR  - http://dl.acm.org/citation.cfm?doid=2766462.2767832
KW  - Algorithms
KW  - Temporal Ranking
KW  - Authority
KW  - Diversity
KW  - Web Archive
KW  - Experimentation
KW  - Performance
ER  - 

TY  - CONF
TI  - Finding Pages on the Unarchived Web
AU  - Huurdeman, Hugo C
AU  - Ben-David, Anat
AU  - Kamps, Jaap
AU  - Samar, Thaer
AU  - de Vries, Arjen P
AB  - Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the Dutch Web archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.
C1  - Piscataway, NJ, USA
C3  - Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2014///
PY  - 2014
SP  - 331
EP  - 340
PB  - IEEE Press
SN  - 978-1-4799-5569-5
UR  - http://dl.acm.org/citation.cfm?id=2740769.2740827
KW  - web archiving
KW  - web archives
KW  - information retrieval
KW  - anchor text
KW  - link evidence
KW  - web crawlers
ER  - 

TY  - CONF
TI  - A Memento Web Browser for iOS
AU  - Tweedy, Heather
AU  - McCown, Frank
AU  - Nelson, Michael L
AB  - The Memento framework allows web browsers to request and view archived web pages in a transparent fashion. However, Memento is still in the early stages of adoption, and browser-plugins are often required to enable Memento support. We report on a new iOS app called the Memento Browser, a web browser that supports Memento and gives iPhone and iPad users transparent access to the world's largest web archives.
C1  - New York, NY, USA
C3  - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2013///
PY  - 2013
DO  - 10.1145/2467696.2467764
SP  - 371
EP  - 372
PB  - ACM
SN  - 978-1-4503-2077-1
UR  - http://doi.acm.org/10.1145/2467696.2467764
KW  - web archiving
KW  - memento
KW  - mobile web
KW  - web browser
ER  - 

TY  - CONF
TI  - EverLast: A Distributed Architecture for Preserving the Web
AU  - Anand, Avishek
AU  - Bedathur, Srikanta
AU  - Berberich, Klaus
AU  - Schenkel, Ralf
AU  - Tryfonopoulos, Christos
AB  - The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80% of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the International Internet Preservation Consortium (IIPC), the Internet Archive (IA) and the European Archive (EA) have been tirelessly working towards preserving the ever changing Web. However, while these web archiving efforts have paid significant attention towards long term preservation of Web data, they have paid little attention to developing an global-scale infrastructure for collecting, archiving, and performing historical analyzes on the collected data. Based on insights from our recent work on building text analytics for Web Archives, we propose EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive. Our system is built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks. In this way, we allow the integration of many archival efforts taken mainly at a national level by national digital libraries. Key features of EverLast include support of time-based text search & analysis and the use of human-assisted archive gathering. In this paper, we outline the overall architecture of EverLast, and present some promising preliminary results.
C1  - New York, NY, USA
C3  - Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2009///
PY  - 2009
DO  - 10.1145/1555400.1555455
SP  - 331
EP  - 340
PB  - ACM
SN  - 978-1-60558-322-8
UR  - http://doi.acm.org/10.1145/1555400.1555455
KW  - web archives
KW  - crawling
KW  - indexing
KW  - time-travel search
ER  - 

TY  - CONF
TI  - Detecting Age of Page Content
AU  - Jatowt, Adam
AU  - Kawai, Yukiko
AU  - Tanaka, Katsumi
AB  - Web pages often contain objects cr eated at different times. The information about the age of such objects may provide useful context for understanding page content and may serve many potential uses. In this paper, we describe a novel concept for detecting approximate creation date s of content elements in Web pages. Our approach is based on dynamically reconstructing page histories using data extracted from external sources - Web archives and efficiently searching inside them to detect insertion dates of content elements. We di scuss various issues involving the proposed approach and demonstrate the example of an application that enhances browsing the Web by inserting annotations with temporal metadata into page content on user request.
C1  - New York, NY, USA
C3  - Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management
DA  - 2007///
PY  - 2007
DO  - 10.1145/1316902.1316925
SP  - 137
EP  - 144
PB  - ACM
SN  - 978-1-59593-829-9
UR  - http://doi.acm.org/10.1145/1316902.1316925
KW  - metadata
KW  - web archive
KW  - age detection
KW  - document annotation
ER  - 

TY  - CONF
TI  - Technical Architecture Overview: Tools for Acquisition, Packaging and Ingest of Web Objects into Multiple Repositories
AU  - Rani, Shweta
AU  - Goodkin, Jay
AU  - Cobb, Judy
AU  - Habing, Tom
AU  - Urban, Richard
AU  - Eke, Janet
AU  - Pearce-Moses, Richard
AB  - This poster d escribes a mod el fo r acquiring, packag ing and ingesting web objec ts for arc hiving in m ultiple repos itories. T his ongoing work is part of t he ECHO DEPosit ory Project [1], a 3-year ND IIPP-p artner digital pr eserva tion project a t the Unive rsity of Illinois at Urbana -Champa ign with partners O CLC, a consortium of content provider partners, and the Library of Congress
C1  - New York, NY, USA
C3  - Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2006///
PY  - 2006
DO  - 10.1145/1141753.1141855
SP  - 360
PB  - ACM
SN  - 1-59593-354-9
UR  - http://doi.acm.org/10.1145/1141753.1141855
KW  - web archiving
KW  - digital preservation
KW  - interoperability architecture
ER  - 

TY  - CONF
TI  - Uncovering the Unarchived Web
AU  - Samar, Thaer
AU  - Huurdeman, Hugo C
AU  - Ben-David, Anat
AU  - Kamps, Jaap
AU  - de Vries, Arjen
AB  - Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages.
C1  - New York, NY, USA
C3  - Proceedings of the 37th International ACM SIGIR Conference on Research &#38; Development in Information Retrieval
DA  - 2014///
PY  - 2014
DO  - 10.1145/2600428.2609544
SP  - 1199
EP  - 1202
PB  - ACM
SN  - 978-1-4503-2257-7
UR  - http://doi.acm.org/10.1145/2600428.2609544
KW  - web archiving
KW  - web archives
KW  - information retrieval
KW  - anchor text
KW  - web crawlers
KW  - web graph
ER  - 

TY  - CONF
TI  - Archiving the Web Using Page Changes Patterns: A Case Study
AU  - Ben Saad, Myriam
AU  - Gançarski, Stéphane
AB  - A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend), or more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive web sites. We first define our pattern model that describes the changes of pages. Then, we present the strategy used to (i) extract the temporal evolution of page changes, to (ii) discover patterns and to (iii) exploit them to improve web archives. We choose the archive of French public TV channels « France Télévisions » as a case study in order to validate our approach. Our experimental evaluation based on real web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
C1  - New York, NY, USA
C3  - Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries
DA  - 2011///
PY  - 2011
DO  - 10.1145/1998076.1998098
SP  - 113
EP  - 122
PB  - ACM
SN  - 978-1-4503-0744-4
UR  - http://doi.acm.org/10.1145/1998076.1998098
KW  - web archiving
KW  - pattern
KW  - web page changes
ER  - 

TY  - CONF
TI  - Archiving the relaxed consistency web
AU  - Xie, Zhiwu
AU  - de Sompel, Herbert
AU  - Liu, Jinyang
AU  - van Reenen, Johann
AU  - Jordan, Ramiro
AB  - The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.
C1  - New York, NY, USA
C3  - Proceedings of the 22nd ACM international conference on Conference on information &#38; knowledge management
DA  - 2013///
PY  - 2013
DO  - 10.1145/2505515.2505551
SP  - 2119
EP  - 2128
PB  - ACM
SN  - 978-1-4503-2263-8
UR  - http://doi.acm.org/10.1145/2505515.2505551
KW  - web archiving
KW  - digital preservation
KW  - consistency
KW  - social network
ER  - 

TY  - CONF
TI  - On Automatically Tagging Web Documents from Examples
AU  - Woodward, Nicholas Joel
AU  - Xu, Weijia
AU  - Norsworthy, Kent
AB  - An emerging need in information retrieval is to identify a set of documents conforming to an abstract description. This task presents two major challenges to existing methods of document retrieval and classification. First, similarity based on overall content is less effective because there may be great variance in both content and subject of documents produced for similar functions, e.g. a presidential speech or a government ministry white paper. Second, the function of the document can be defined based on user interests or the specific data set through a set of existing examples, which cannot be described with standard categories. Additionally, the increasing volume and complexity of document collections demands new scalable computational solutions. We conducted a case study using web-archived data from the Latin American Government Documents Archive (LAGDA) to illustrate these problems and challenges. We propose a new hybrid approach based on Naïve Bayes inference that uses mixed n-gram models obtained from a training set to classify documents in the corpus. The approach has been developed to exploit parallel processing for large scale data set. The preliminary work shows promising results with improved accuracy for this type of retrieval problem.
C1  - New York, NY, USA
C3  - Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
DA  - 2012///
PY  - 2012
DO  - 10.1145/2348283.2348494
SP  - 1111
EP  - 1112
PB  - ACM
SN  - 978-1-4503-1472-5
UR  - http://doi.acm.org/10.1145/2348283.2348494
KW  - web archive
KW  - na?ve bayesian classification
ER  - 

TY  - CONF
TI  - Towards Automatic Document Migration: Semantic Preservation of Embedded Queries
AU  - Triebsees, Thomas
AU  - Borghoff, Uwe M
AB  - Archivists and librarians face an ever increasing amount of digital material. Their task is to preserve its authentic content. In the long run, this requires periodic migrations (from one format to another or from one hardware/software platform to another). Document migrations are challenging tasks where tool-support and a high degree of automation are important. A central aspect is that documents are often mutually related and, hence, a document's semantics has to be considered in its whole context. References between documents are usually formulated in graph- or tree-based query languages like URL or XPath. A typical scenario is web-archiving where websites are stored inside a server infrastructure that can be queried from HTML-files using URLs. Migrating websites will often require link adaptation in order to preserve link consistency. Although automated and "trustworthy" preservation of link consistency is easy to postulate, it is hard to carry out, in particular, if "trustworthy" means "provably working correct". In this paper, we propose a general approach to semantically evaluating and constructing graph queries, which at the same time conform to a regular grammar, appear as part of a document's content, and access a graph structure that is specified using First- Order Predicate Logic (FOPL). In order to do so, we adapt model checking techniques by constructing suitable query automata. We integrate these techniques into our preservation framework [12] and show the feasibility of this approach using an example. We migrate a website to a specific archiving format and demonstrate the automated preservation of link-consistency. The approach shown in this paper mainly contributes to a higher degree of automation in document migration while still maintaining a high degree of "trustworthiness", namely "provable correctness".
C1  - New York, NY, USA
C3  - Proceedings of the 2007 ACM Symposium on Document Engineering
DA  - 2007///
PY  - 2007
DO  - 10.1145/1284420.1284472
SP  - 209
EP  - 218
PB  - ACM
SN  - 978-1-59593-776-6
UR  - http://doi.acm.org/10.1145/1284420.1284472
KW  - digital preservation
KW  - automated document migration
KW  - link consistency
KW  - query processing
ER  - 

TY  - CONF
TI  - Scraping SERPs for Archival Seeds: It Matters When You Start
AU  - Nwala, Alexander C
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AB  - Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.
C1  - New York, NY, USA
C3  - Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
DA  - 2018///
PY  - 2018
DO  - 10.1145/3197026.3197056
SP  - 263
EP  - 272
PB  - ACM
SN  - 978-1-4503-5178-2
UR  - http://doi.acm.org/10.1145/3197026.3197056
KW  - web archiving
KW  - collection building
KW  - crawling
KW  - discoverability
ER  - 

TY  - JOUR
TI  - Time-aware Approaches to Information Retrieval
AU  - Kanhabua, Nattiya
T2  - SIGIR Forum
AB  - In this thesis, we address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query. Our contributions in this thesis are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking. Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. In this thesis, we analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness. We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms. In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques. Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections. Available online at: http://www.idi.ntnu.no/research/doctor_theses/nattiya.pdf.
DA  - 2012///
PY  - 2012
DO  - 10.1145/2215676.2215691
VL  - 46
IS  - 1
SP  - 85
SN  - 0163-5840
UR  - http://doi.acm.org/10.1145/2215676.2215691
ER  - 

TY  - CONF
TI  - Effects of Maximum Flow Algorithm on Identifying Web Community
AU  - Imafuji, Noriko
AU  - Kitsuregawa, Masaru
AB  - In this paper, we describe the effects of using maximum flow algorithm on extracting web community from the web. A web community is a set of web pages having a common topic. Since the web can be recognized as a graph that consists of nodes and edges that represent web pages and hyperlinks respectively, so far various graph theoretical approaches have been proposed to extract web communities from the web graph. The method of finding a web community using maximum flow algorithm was proposed by NEC Research Institute in Princeton two years ago. However the properties of web communities derived by this method have been seldom known. To examine the effects of this method, we selected 30 topics randomly and experimented using Japanese web archives crawled in 2000. Through these experiments, it became clear that the method has both advantages and disadvantages. We will describe some strategies to use this method effectively. Moreover, by using same topics, we examined another method that is based on complete bipartite graphs. We compared the web communities obtained by those methods and analyzed those characteristics.
C1  - New York, NY, USA
C3  - Proceedings of the 4th International Workshop on Web Information and Data Management
DA  - 2002///
PY  - 2002
DO  - 10.1145/584931.584941
SP  - 43
EP  - 48
PB  - ACM
SN  - 1-58113-593-9
UR  - http://doi.acm.org/10.1145/584931.584941
KW  - web graph
KW  - web community
KW  - maximum-flow algorithm
ER  - 

TY  - CONF
TI  - Temporal Analog Retrieval Using Transformation over Dual Hierarchical Structures
AU  - Zhang, Yating
AU  - Jatowt, Adam
AU  - Tanaka, Katsumi
AB  - In recent years, we have witnessed a rapid increase of text con- tent stored in digital archives such as newspaper archives or web archives. Many old documents have been converted to digital form and made accessible online. Due to the passage of time, it is however difficult to effectively perform search within such collections. Users, especially younger ones, may have problems in finding appropriate keywords to perform effective search due to the terminology gap arising between their knowledge and the unfamiliar domain of archival collections. In this paper, we provide a general framework to bridge different domains across-time and, by this, to facilitate search and comparison as if carried in user's familiar domain (i.e., the present). In particular, we propose to find analogical terms across temporal text collections by applying a series of transformation procedures. We develop a cluster-biased transformation technique which makes use of hierarchical cluster structures built on the temporally distributed document collections. Our methods do not need any specially prepared training data and can be applied to diverse collections and time periods. We test the performance of the proposed approaches on the collections separated by both short (e.g., 20 years) and long time gaps (70 years), and we report improvements in range of 18%-27% over short and 56%-92% over long periods when compared to state-of-the-art baselines.
C1  - New York, NY, USA
C3  - Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
DA  - 2017///
PY  - 2017
DO  - 10.1145/3132847.3132917
SP  - 717
EP  - 726
PB  - ACM
SN  - 978-1-4503-4918-5
UR  - http://doi.acm.org/10.1145/3132847.3132917
KW  - cluster-biased
KW  - dual hierarchical structure
KW  - heterogeneous document collections
KW  - temporal analog
ER  - 

TY  - CONF
TI  - Durable Top-k Search in Document Archives
AU  - U, Leong Hou
AU  - Mamoulis, Nikos
AU  - Berberich, Klaus
AU  - Bedathur, Srikanta
AB  - We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.
C1  - New York, NY, USA
C3  - Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data
DA  - 2010///
PY  - 2010
DO  - 10.1145/1807167.1807228
SP  - 555
EP  - 566
PB  - ACM
SN  - 978-1-4503-0032-2
UR  - http://doi.acm.org/10.1145/1807167.1807228
KW  - temporal queries
KW  - document archives
KW  - top-k search
ER  - 

TY  - CONF
TI  - How to Choose a Digital Preservation Strategy: Evaluating a Preservation Planning Procedure
AU  - Strodl, Stephan
AU  - Becker, Christoph
AU  - Neumayer, Robert
AU  - Rauber, Andreas
AB  - An increasing number of institutions throughout the world face legal obligations or business needs to collect and preserve digital objects over several decades. A range of tools exists today to support the variety of preservation strategies such as migration or emulation. Yet, different preservation requirements across institutions and settings make the decision on which solution to implement very diffcult. This paper presents the PLANETS Preservation Planning approach. It provides an approved way to make informed and accountable decisions on which solution to implement in order to optimally preserve digital objects for a given purpose. It is based on Utility Analysis to evaluate the performance of various solutions against well-defined requirements and goals. The viability of this approach is shown in a range of case studies for different settings. We present its application to two scenarios of web archives, two collections of electronic publications, and a collection of multimedia art. This work focuses on the different requirements and goals in the various preservation settings.
C1  - New York, NY, USA
C3  - Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries
DA  - 2007///
PY  - 2007
DO  - 10.1145/1255175.1255181
SP  - 29
EP  - 38
PB  - ACM
SN  - 978-1-59593-644-8
UR  - http://doi.acm.org/10.1145/1255175.1255181
KW  - digital preservation
KW  - evaluation
KW  - digital libraries
KW  - OAIS model
KW  - preservation planning
KW  - utility analysis
ER  - 

TY  - CONF
TI  - Designing Efficient Sampling Techniques to Detect Webpage Updates
AU  - Tan, Qingzhao
AU  - Zhuang, Ziming
AU  - Mitra, Prasenjit
AU  - Giles, C Lee
AB  - Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.
C1  - New York, NY, USA
C3  - Proceedings of the 16th International Conference on World Wide Web
DA  - 2007///
PY  - 2007
DO  - 10.1145/1242572.1242738
SP  - 1147
EP  - 1148
PB  - ACM
SN  - 978-1-59593-654-7
UR  - http://doi.acm.org/10.1145/1242572.1242738
KW  - web crawler
KW  - sampling
KW  - search engine
ER  - 

TY  - CONF
TI  - Information Evolution in Wikipedia
AU  - Ceroni, Andrea
AU  - Georgescu, Mihai
AU  - Gadiraju, Ujwal
AU  - Naini, Kaweh Djafari
AU  - Fisichella, Marco
AB  - The Web of data is constantly evolving based on the dynamics of its content. Current Web search engine technologies consider static collections and do not factor in explicitly or implicitly available temporal information, that can be leveraged to gain insights into the dynamics of the data. In this paper, we hypothesize that by employing the temporal aspect as the primary means for capturing the evolution of entities, it is possible to provide entity-based accessibility to Web archives. We empirically show that the edit activity on Wikipedia can be exploited to provide evidence of the evolution of Wikipedia pages over time, both in terms of their content and in terms of their temporally defined relationships, classified in literature as events. Finally, we present results from our extensive analysis of a dataset consisting of 31,998 Wikipedia pages describing politicians, and observations from in-depth case studies. Our findings reflect the usefulness of leveraging temporal information in order to study the evolution of entities and breed promising grounds for further research.
C1  - New York, NY, USA
C3  - Proceedings of The International Symposium on Open Collaboration
DA  - 2014///
PY  - 2014
DO  - 10.1145/2641580.2641612
SP  - 24:1
EP  - 24:10
PB  - ACM
SN  - 978-1-4503-3016-9
UR  - http://doi.acm.org/10.1145/2641580.2641612
KW  - Wikipedia
KW  - Entity Evolution
KW  - Events
KW  - Temporal Information
ER  - 

TY  - CONF
TI  - Efficient Temporal Keyword Search over Versioned Text
AU  - Anand, Avishek
AU  - Bedathur, Srikanta
AU  - Berberich, Klaus
AU  - Schenkel, Ralf
AB  - Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions. We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.
C1  - New York, NY, USA
C3  - Proceedings of the 19th ACM International Conference on Information and Knowledge Management
DA  - 2010///
PY  - 2010
DO  - 10.1145/1871437.1871528
SP  - 699
EP  - 708
PB  - ACM
SN  - 978-1-4503-0099-5
UR  - http://doi.acm.org/10.1145/1871437.1871528
KW  - partition selection
KW  - partitioned inverted index
KW  - synopses
KW  - time-travel search
ER  - 

TY  - CONF
TI  - Local Methods for Estimating Pagerank Values
AU  - Chen, Yen-Yu
AU  - Gan, Qingqing
AU  - Suel, Torsten
AB  - The Google search engine uses a method called PageRank, together with term-based and other ranking techniques, to order search results returned to the user. PageRank uses link analysis to assign a global importance score to each web page. The PageRank scores of all the pages are usually determined off-line in a large-scale computation on the en- tire hyperlink graph of the web, and several recent studies have focused on improving the efficiency of this computa- tion, which may require multiple hours on a workstation. However, in some scenarios, such as online analysis of link evolution and mining of large web archives such as the In- ternet Archive, it may be desirable to quickly approximate or update the PageRanks of individual nodes without per- forming a large-scale computation on the entire graph. We address this problem by studying several methods for effi- ciently estimating the PageRank score of a particular web page using only a small subgraph of the entire web. In our model, we assume that the graph is accessible remotely via a link database (such as the AltaVista Connectivity Server) or is stored in a relational database that performs lookups on disks to retrieve node and connectivity information. We show that a reasonable estimate of the PageRank value of a node is possible in most cases by retrieving only a moderate number of nodes in the local neighborhood of the node.
C1  - New York, NY, USA
C3  - Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management
DA  - 2004///
PY  - 2004
DO  - 10.1145/1031171.1031248
SP  - 381
EP  - 389
PB  - ACM
SN  - 1-58113-874-1
UR  - http://doi.acm.org/10.1145/1031171.1031248
KW  - search engines
KW  - pagerank
KW  - external memory algorithms
KW  - link database
KW  - link-based ranking
KW  - out-of-core
ER  - 

TY  - CONF
TI  - Named Entity Evolution Analysis on Wikipedia
AU  - Holzmann, Helge
AU  - Risse, Thomas
AB  - Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work.
C1  - New York, NY, USA
C3  - Proceedings of the 2014 ACM Conference on Web Science
DA  - 2014///
PY  - 2014
DO  - 10.1145/2615569.2615639
SP  - 241
EP  - 242
PB  - ACM
SN  - 978-1-4503-2622-3
UR  - http://doi.acm.org/10.1145/2615569.2615639
KW  - named entity evolution
KW  - semantics
KW  - wikipedia
ER  - 

TY  - CONF
TI  - Archiving Web Site Resources: A Records Management View
AU  - Pennock, Maureen
AU  - Kelly, Brian
AB  - In this paper, we propose the use of records management principles to identify and manage Web site resources with enduring value as records. Current Web archiving activities, collaborative or organisational, whilst extremely valuable in their own right, often do not and cannot incorporate requirements for proper records management. Material collected under such initiatives therefore may not be reliable or authentic from a legal or archival perspective, with insufficient metadata collected about the object during its active life, and valuable materials destroyed whilst ephemeral items are maintained. Education, training, and collaboration between stakeholders are integral to avoiding these risks and successfully preserving valuable Web-based materials.
C1  - New York, NY, USA
C3  - Proceedings of the 15th International Conference on World Wide Web
DA  - 2006///
PY  - 2006
DO  - 10.1145/1135777.1135978
SP  - 987
EP  - 988
PB  - ACM
SN  - 1-59593-323-9
UR  - http://doi.acm.org/10.1145/1135777.1135978
KW  - best practices
KW  - archiving web sites
KW  - records management
ER  - 

TY  - JOUR
TI  - Web Curator Tool
AU  - Beresford, Philip
T2  - Ariadne
DA  - 2007///
PY  - 2007
IS  - 50
SN  - 1361-3200
UR  - http://www.ariadne.ac.uk/issue/50/beresford/
ER  - 

TY  - RPRT
TI  - Részletes beszámoló a 2017-2018 évi webarchiválási pilot projektről
AU  - OSZK Webarchiválási munkacsoport
CY  - Budapest, Hungary
DA  - 2018///
PY  - 2018
PB  - Országos Széchényi Könyvtár
ER  - 

TY  - RPRT
TI  - Az OSZK webaratás pilot projektjének gyűjtőköri tervezete
AU  - OSZK Webarchiválási munkacsoport
DA  - 2017///
PY  - 2017
PB  - Országos Széchényi Könyvtár
L1  - https://webarchivum.oszk.hu/wp-content/uploads/2020/02/Webarchivum_gyujtokor_tervezet.pdf
ER  - 

TY  - ELEC
TI  - UK Government Web Archive
AU  - National Archives
DA  - 1996///
PY  - 1996
ER  - 

TY  - ELEC
TI  - Webarchiv
AU  - Kvasnica, Jaroslav
DA  - 2016///
PY  - 2016
ER  - 

TY  - CHAP
TI  - Historical Web as a Tool for Analyzing Social Change BT - Second International Handbook of Internet Research
AU  - Schroeder, Ralph
AU  - Brügger, Niels
AU  - Cowls, Josh
T2  - Second International Handbook of Internet Research
A2  - Hunsinger, Jeremy
A2  - Allen, Matthew M
A2  - Klastrup, Lisbeth
AB  - This chapter discusses how the World Wide Web can be used as a resource for historians and social scientists. The web has existed for more than two decades and been used for many purposes, including as a source of information, entertainment, and much else. It has become an indispensable part of our daily lives. Future historians and social scientists are therefore bound to look to the web, its content, and structure, to understand how society was changing – just as they have used various records such as letters, novels, newspapers, radio, television, and other artifacts as a record of the past for the pre-digital era. This chapter explores how scholars can make use of the archived web as a source for understanding historical patterns of culture and society, including the challenges they face in doing so.
CY  - Dordrecht
DA  - 2020///
PY  - 2020
SP  - 489
EP  - 504
PB  - Springer Netherlands
SN  - 978-94-024-1555-1
UR  - https://doi.org/10.1007/978-94-024-1555-1_24
ER  - 

TY  - CHAP
TI  - Mirkwood: An Online Parallel Crawler BT - International Joint Conference: 12th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2019) and 10th International Conference on EUropean Transnational Education (I
AU  - García, Juan F
AU  - Carriegos, Miguel V
T2  - International Joint Conference: 12th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2019) and 10th International Conference on EUropean Transnational Education (ICEUTE 2019). CISIS 2019, ICEUTE 2019. Adva
A2  - Martínez Álvarez, Francisco
A2  - Troncoso Lora, Alicia
A2  - Sáez Muñoz, José António
A2  - Quintián, Héctor
A2  - Corchado, Emilio
AB  - In this research we present Mirkwood, a parallel crawler for fast and online syntactic analysis of websites. Configured by default to behave as a focused crawler, analysing exclusively a limited set of hosts, it includes seed extraction capabilities, which allows it to autonomously obtain high quality sites to crawl. Mirkwood is designed to run in a computer cluster, taking advantage of all the cores of its individual machines (virtual or physical), although it can also run on a single machine. By analysing sites online and not downloading the web content, we achieve crawling speeds several orders of magnitude faster than if we did, while assuring that the content we check is up to date. Our crawler relies on MPI, for the cluster of computers, and threading, for each individual machine of the cluster. Our software has been tested in several platforms, including the Supercomputer Calendula. Mirkwood is entirely written in Java language, making it multi–platform and portable.
CY  - Cham
DA  - 2020///
PY  - 2020
SP  - 47
EP  - 56
PB  - Springer International Publishing
SN  - 978-3-030-20005-3
KW  - computation
KW  - Crawler Parallel
KW  - High performance computing
ER  - 

TY  - CHAP
TI  - Memory of the World, Documentary Heritage and Digital Technology: Critical Perspectives BT - The UNESCO Memory of the World Programme: Key Aspects and Recent Developments
AU  - Prodan, Anca Claudia
T2  - The UNESCO Memory of the World Programme. Heritage Studies.
A2  - Edmondson, Ray
A2  - Jordan, Lothar
A2  - Prodan, Anca Claudia
AB  - This chapter explores the potential that critically oriented perspectives hold for broadened insights about the heritage value of digital documents. Digital technology has significantly changed the way documents are conceptualized, created, accessed, transmitted and preserved, and digital documents are characterized by features that challenge established perspectives. Although any of these features may hold heritage significance, digital documentary heritage is poorly represented in the context of the UNESCO Memory of the World Programme (MoW), in particular on the International Memory of the World Register, which contains a selection of some of the most globally representative documents in any form, including the digital. Observing that libraries and archives, and their underlying disciplines, which have informed MoW, have been dominated by positivism, this chapter builds on the assumption that approaching documents too narrowly entails the risk of overlooking the manifold significance they could have. Consequently, I suggest that moving away from positivism and adopting critical perspectives might help us understand more comprehensively the manifold heritage significance of digital documents. For illustration, I am using the example of software, and I discuss how the adoption of critical perspectives enables broadened insights about the significance of software, not just as a component in a digital document but also as a document in its own right.
CY  - Cham
DA  - 2020///
PY  - 2020
SP  - 159
EP  - 174
PB  - Springer International Publishing
SN  - 978-3-030-18441-4
UR  - https://doi.org/10.1007/978-3-030-18441-4_11
KW  - Critical code studies
KW  - Critical perspectives
KW  - Definitions
KW  - Digital documentary heritage
KW  - Software heritage
KW  - Software studies
ER  - 

TY  - JOUR
TI  - Web archives as a data resource for digital scholars
AU  - Vlassenroot, Eveline
AU  - Chambers, Sally
AU  - Di Pretoro, Emmanuel
AU  - Geeraert, Friedel
AU  - Haesendonck, Gerald
AU  - Michel, Alejandra
AU  - Mechant, Peter
T2  - International Journal of Digital Humanities
AB  - The aim of this article is to provide an exploratory analysis of the landscape of web archiving activities in Europe. Our contribution, based on desk research, and complemented with data from interviews with representatives of European heritage institutions, provides a descriptive overview of the state-of-the-art of national web archiving in Europe. It is written for a broad interdisciplinary audience, including cultural heritage professionals, IT specialists and managers, and humanities and social science researchers. The legal, technical and operational aspects of web archiving and the value of web archives as born-digital primary research resources are both explored. In addition to investigating the organisations involved and the scope of their web archiving programmes, the curatorial aspects of the web archiving process, such as selection of web content, the tools used and the provision of access and discovery services are also considered. Furthermore, general policies related to web archiving programmes are analysed. The article concludes by offering four important issues that digital scholars should consider when using web archives as a historical data source. Whilst recognising that this study was limited to a sample of only nine web archives, this article can nevertheless offer some useful insights into the technical, legal, curatorial and policy-related aspects of web archiving. Finally, this paper could function as a stepping stone for more extensive and qualitative research.
DA  - 2019/04/08/
PY  - 2019
DO  - 10.1007/s42803-019-00007-7
VL  - 1
IS  - 1
SP  - 85
EP  - 111
SN  - 2524-7832
UR  - http://link.springer.com/10.1007/s42803-019-00007-7
L1  - https://link.springer.com/content/pdf/10.1007%252Fs42803-019-00007-7.pdf
KW  - Web archives
KW  - Copyright Technology for web archiving
KW  - Curation of digital collections
KW  - Digital scholarship
ER  - 

TY  - JOUR
TI  - Estimating PageRank deviations in crawled graphs
AU  - Holzmann, Helge
AU  - Anand, Avishek
AU  - Khosla, Megha
T2  - Applied Network Science
AB  - Most real-world graphs collected from the Web like Web graphs and social network graphs are partially discovered or crawled. This leads to inaccurate estimates of graph properties based on link analysis such as PageRank. In this paper we focus on studying such deviations in ordering/ranking imposed by PageRank over crawled graphs. We first show that deviations in rankings induced by PageRank are indeed possible. We measure how much a ranking, induced by PageRank, on an input graph could deviate from the original unseen graph. More importantly, we are interested in conceiving a measure that approximates the rank correlation among them without any knowledge of the original graph. To this extent we formulate the HAK measure that is based on computing the impact redistribution of PageRank according to the local graph structure. We further propose an algorithm that identifies connected subgraphs over the input graph for which the relative ordering is preserved. Finally, we perform extensive experiments on both real-world Web and social network graphs with more than 100M vertices and 10B edges as well as synthetic graphs to showcase the utility of HAK and our High-fidelity Component Selection approach.
DA  - 2019///
PY  - 2019
DO  - 10.1007/s41109-019-0201-9
VL  - 4
IS  - 1
SP  - 86
SN  - 2364-8228
UR  - https://doi.org/10.1007/s41109-019-0201-9
KW  - Crawls
KW  - PageRank
KW  - Ranking deviations
ER  - 

TY  - JOUR
TI  - The invention and dissemination of the spacer gif: implications for the future of access and use of web archives
AU  - Owens, Trevor
AU  - Thomas, Grace Helen
T2  - International Journal of Digital Humanities
AB  - Over the last two decades publishing and distributing content on the Web has become a core part of society. This ephemeral content has rapidly become an essential component of the human record. Writing histories of the late 20th and early 21st century will require engaging with web archives. The scale of web content and of web archives presents significant challenges for how research can access and engage with this material. Digital humanities scholars are advancing computational methods to work with corpora of millions of digitized resources, but to fully engage with the growing content of two decades of web archives, we now require methods to approach and examine billions, ultimately trillions, of incongruous resources. This article approaches one seemingly insignificant, but fundamental, aspect in web design history: the use of tiny transparent images as a tool for layout design, and surfaces how traces of these files can illustrate future paths for engaging with web archives. This case study offers implications for future methods allowing scholars to engage with web archives. It also prompts considerations for librarians and archivists in thinking about web archives as data and the development of systems, qualitative and quantitative, through which to make this material available.
DA  - 2019///
PY  - 2019
DO  - 10.1007/s42803-019-00006-8
VL  - 1
IS  - 1
SP  - 71
EP  - 84
SN  - 2524-7840
UR  - https://doi.org/10.1007/s42803-019-00006-8
KW  - Computational scholarship
KW  - Cryptographic hash
KW  - Digital history
KW  - Web archiving
ER  - 

TY  - JOUR
TI  - Born-digital archives
AU  - Ries, Thorsten
AU  - Palkó, Gábor
T2  - International Journal of Digital Humanities
DA  - 2019///
PY  - 2019
DO  - 10.1007/s42803-019-00011-x
VL  - 1
IS  - 1
SP  - 1
EP  - 11
SN  - 2524-7840
UR  - https://doi.org/10.1007/s42803-019-00011-x
ER  - 

TY  - CHAP
TI  - Behind the Scenes of Web Archiving: Metadata of Harvested Websites
AU  - Di Pretoro, Emmanuel
AU  - Geeraert, Friedel
T2  - Press, Trust and Understanding: the value of metadata in a digitally joined-up world
A2  - Depoortere, R.
A2  - Gheldof, T.
A2  - Styven, D.
A2  - Van, J.Eycken, Der
CY  - Brussels
DA  - 2019///
PY  - 2019
SP  - 63
EP  - 74
PB  - Archives et Bibliothèques de Belgique - Archief- en Bibliotheekwezen in België
UR  - https://hal.archives-ouvertes.fr/hal-02124714
ER  - 

TY  - JOUR
TI  - Why Archive the Web?
AU  - Pittman, Bess
T2  - Online Searcher
AB  - The article discusses the importance for organizations to have a web archiving program to avoid data losses. Topics include the use of web crawlers for archiving, how to select the most suitable crawler, and third-party companies that offer crawling, hosting, and rendering service for web archives. It also offers information on open source crawlers and several successful web archiving projects.
DA  - 2018/11//
PY  - 2018
VL  - 42
IS  - 6
SP  - 63
EP  - 66
SN  - 23249684
KW  - ACCESS to information
KW  - ARCHIVES
KW  - COMPUTER software
KW  - INTERNET
KW  - SOFTWARE architecture
KW  - WORLD Wide Web
ER  - 

TY  - JOUR
TI  - Web Archive
AU  - Finnemann, Niels Ole
T2  - KNOWLEDGE ORGANIZATION
AB  - This article deals with the function of general web archives within the emerging organization of fastgrowing digital knowledge resources. It opens with a brief overview of reasons why general web archives are needed. Sections two and three present major, long term web archive initiatives and discuss the purposes and possible functions and unknown future needs, demands and concerns. Section four analyses three main principles for the selection of materials to be preserved in contemporary web archiving strategies, topic-centric, domain-centric and time-centric archiving strategies and how to combine these to provide a broad and rich archive. Section five is concerned with inherent limitations and why web archives are always flawed. The last section deals with the question whether and how web archives may be considered a new type of knowledge organization system (KOS) necessary to preserve web materials, to allow for the development of a range of new methodologies, to analyze these particular corpora in long term and long tail perspectives, and to build a bridge towards the rapidly expanding but fragmented landscape of digital archives, libraries, research infrastructures and other sorts of digital repositories. [ABSTRACT FROM AUTHOR]
DA  - 2019/01//
PY  - 2019
DO  - 10.5771/0943-7444-2019-1-47
VL  - 46
IS  - 1
SP  - 47
EP  - 70
SN  - 0943-7444
UR  - http://10.0.22.139/0943-7444-2019-1-47
L4  - https://www.nomos-elibrary.de/index.php?doi=10.5771/0943-7444-2019-1-47
KW  - Web archiving
KW  - archives
KW  - Corpora (Linguistics)
KW  - digital
KW  - Information science
KW  - Knowledge management
KW  - knowledge organization
KW  - materials
KW  - web archives
KW  - Web archives
ER  - 

TY  - CHAP
TI  - A Framework for Web Archiving and Guaranteed Retrieval
AU  - Devendran, A
AU  - Arunkumar, K
T2  - Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016
A2  - Sharma, Neha
A2  - Chakrabarti, Amlan
A2  - Balas, Valentina Emilia
AB  - As of today, ‘web.archive.org’ has more than 338 billion web pages archived. How many of those pages are 100% retrieval. How many of the pages were left out or ignored just because the page doesn’t have some compatibility issue? How many of them were vernacular language and encoded in different formats (before UNICODE is standardized)? If we are talking about the content-type text. Consider other mime types which were encoded and decoded with different algorithms. The fundamental reason for this lies with the fundamental representation of digital data. We all know a sequence of 0 s and 1 s doesn’t make proper sense unless it is decoded properly. At the time of archiving, the browsers which could have rendered properly might have gone obsolete or upgraded way beyond to recognize old formats or the browser platforms could have been upgraded to recognize old formats. We studied various data preservation, web archiving related works and proposed a new framework that could store the exact client browser details (user-agent) in the WARC record and use it to load corresponding browser @ client side and render the archived content.
CY  - Singapore
DA  - 2020///
PY  - 2020
SP  - 205
EP  - 215
PB  - Springer Singapore
SN  - 978-981-13-9364-8
UR  - http://link.springer.com/10.1007/978-981-13-9364-8_16
KW  - Web archiving
KW  - Guaranteed retrieval
KW  - Personal data
ER  - 

TY  - JOUR
TI  - Building Companionship Between Community and Personal Archiving: Strengthening Personal Digital Archiving Support in Community-Based Mobile Digitization Projects
AU  - Han, Ruohua
T2  - Preservation, Digital Technology & Culture
AB  - The interconnectedness between personal digital archiving (PDA) and community-based digital archiving provides an entry point for thinking about how to better bridge the two within single projects. Flexibility and sustainability are dimensions that warrant special consideration to support PDA within community-based digital archiving projects. This paper examines the flexibility and sustainability of two community-based mobile digitization projects (Culture in Transit and Georgia HomePLACE DigiKits) in supporting PDA. The assessment shows that the projects are in a good position to support PDA, with only some concerns about ensuring sustainable access to digitization equipment and sufficient guidance in long-term preservation. Drawing from this work, I propose three ways community-based mobile digitization projects can be redesigned to further strengthen their support for PDA without undermining their community-based objectives. The goal of this paper is to demonstrate the value in considering connections and differences between community and personal archiving needs in current and future projects, and calls for further coordination of efforts and collaboration to build better collaboration between community and personal archiving.
DA  - 2019/03/26/
PY  - 2019
DO  - 10.1515/pdtc-2018-0014
VL  - 48
IS  - 1
SP  - 6
EP  - 16
SN  - 2195-2965
UR  - http://10.0.5.235/pdtc-2018-0014
L4  - http://www.degruyter.com/view/j/pdtc.2019.48.issue-1/pdtc-2018-0014/pdtc-2018-0014.xml
KW  - Collaboration
KW  - Community archiving
KW  - COMMUNITY involvement
KW  - DIGITAL preservation
KW  - DIGITIZATION of archival materials
KW  - Mobile digitization units
KW  - Personal digital archiving (PDA)
KW  - TECHNOLOGICAL innovations
KW  - WEB archiving
ER  - 

TY  - JOUR
TI  - Accessing Web Archives: Integrating an Archive-It Collection into EBSCO Discovery Service
AU  - Beis, Christina A.
AU  - Harris, Kayla Nicole
AU  - Shreffler, Stephanie L.
T2  - Journal of Web Librarianship
AB  - Effective collaboration between archives and technical services can increase the discoverability of special collection materials. Archivists at the University of Dayton Libraries began using Archive-It to capture websites relevant to their collecting policies in 2015. However, the collections were only made available to users from the University of Dayton page on the Archive-It website. Content was isolated in a separate platform and was not promoted to users. Working together, the team of archivists and technical services librarians incorporated the web archive collections into the Libraries' EBSCO Discovery Service (EDS) discovery layer. A local data dictionary was created based on OCLC's Descriptive Metadata for Web Archiving report (2018), and metadata was added at the seed and collection levels. The result was indexed content on a single, user-friendly platform. The web archive collections were then marketed to the University of Dayton community, and statistics were generated on their use. [ABSTRACT FROM AUTHOR]
DA  - 2019/07/03/
PY  - 2019
DO  - 10.1080/19322909.2019.1625844
VL  - 13
IS  - 3
SP  - 246
EP  - 259
SN  - 1932-2909
UR  - https://www.tandfonline.com/doi/full/10.1080/19322909.2019.1625844
KW  - web archiving
KW  - academic libraries
KW  - Archive-It
KW  - collaboration
KW  - discovery layer: EBSCO Discovery Services: metadat
KW  - social media
KW  - special collections
KW  - web-scale discovery
ER  - 

TY  - JOUR
TI  - Plan U: Universal access to scientific and medical research via funder preprint mandates
AU  - Sever, Richard
AU  - Eisen, Michael
AU  - Inglis, John
T2  - PLOS Biology
AB  - Preprint servers such as arXiv and bioRxiv represent a highly successful and relatively low cost mechanism for providing free access to research findings. By decoupling the dissemination of manuscripts from the much slower process of evaluation and certification by journals, preprints also significantly accelerate the pace of research itself by allowing other researchers to begin building on new results immediately. If all funding agencies were to mandate posting of preprints by grantees—an approach we term Plan U (for “universal”)—free access to the world’s scientific output for everyone would be achieved with minimal effort. Moreover, the existence of all articles as preprints would create a fertile environment for experimentation with new peer review and research evaluation initiatives, which would benefit from a reduced barrier to entry because hosting and archiving costs were already covered. [ABSTRACT FROM AUTHOR]
DA  - 2019/06/04/
PY  - 2019
DO  - 10.1371/journal.pbio.3000273
VL  - 17
IS  - 6
SP  - e3000273
SN  - 1545-7885
UR  - http://10.0.5.91/journal.pbio.3000273
L4  - http://dx.plos.org/10.1371/journal.pbio.3000273
KW  - WEB archiving
KW  - Economics
KW  - Health care
KW  - Health economics
KW  - MEDICAL research
KW  - Medicine and health sciences
KW  - OPEN access publishing
KW  - Peer review
KW  - Perspective
KW  - RESEARCH
KW  - Research and analysis methods
KW  - Research assessment
KW  - Research funding
KW  - Research grants
KW  - Research quality assessment
KW  - Science policy
KW  - Scientific publishing
KW  - Social sciences
KW  - WEB hosting
KW  - WEB servers
ER  - 

TY  - JOUR
TI  - Experimenting with computational methods for large-scale studies of tracking technologies in web archives
AU  - Nielsen, Janne
T2  - Internet Histories
DA  - 2019/10/02/
PY  - 2019
DO  - 10.1080/24701475.2019.1671074
VL  - 3
IS  - 3-4
SP  - 293
EP  - 315
SN  - 2470-1475
UR  - https://www.tandfonline.com/doi/full/10.1080/24701475.2019.1671074
KW  - big data
KW  - computational methods
KW  - historiography
KW  - Web history
KW  - web tracking
ER  - 

TY  - JOUR
TI  - 2014 not found: a cross-platform approach to retrospective web archiving
AU  - Ben-David, Anat
T2  - Internet Histories
DA  - 2019/10/02/
PY  - 2019
DO  - 10.1080/24701475.2019.1654290
VL  - 3
IS  - 3-4
SP  - 316
EP  - 342
SN  - 2470-1475
UR  - https://www.tandfonline.com/doi/full/10.1080/24701475.2019.1654290
KW  - Google
KW  - Internet Archive
KW  - Twitter
KW  - War in Gaza
KW  - web archiving
KW  - Wikipedia
KW  - YouTube
ER  - 

TY  - JOUR
TI  - Researching public library programs through Facebook events: a new research approach
AU  - Mathiasson, Mia Høj
AU  - Jochumsen, Henrik
T2  - Journal of Documentation
DA  - 2019/07/08/
PY  - 2019
DO  - 10.1108/JD-08-2018-0137
VL  - 75
IS  - 4
SP  - 857
EP  - 875
SN  - 0022-0418
UR  - https://www.emerald.com/insight/content/doi/10.1108/JD-08-2018-0137/full/html
KW  - Web archiving
KW  - Facebook events
KW  - Grounded theory
KW  - Public libraries
KW  - Public library programmes
KW  - Research methods
KW  - Social media content
ER  - 

TY  - JOUR
TI  - Personal and Community Connection: Introduction to the Special Issue on Personal Digital Archiving 2018
AU  - Condron, Melody
T2  - Preservation, Digital Technology & Culture
AB  - On April 23–25, 2018, the University of Houston hosted the annual Personal Digital Archiving (PDA) Conference in Houston, Texas. The conference is a focused, single-track event that brings together information professionals, students, and non-academics. Though small, the conference commonly attracts attendees from around the world to discuss topics focused on the intersection of personal archiving and technology. The three-day event was comprised of two full days of presentations to all attendees. Over 140 attendees from five countries were in attendance. Two keynotes, nineteen sessions with question and answer panels, seven posters, and six lightning talks were presented. A third day offered two hands-on workshops and a tour of the Houston Metropolitan Research Center. In this introduction, the Chair of the Conference Planning Committee and Guest Editor of this issue, Melody Condron, discusses highlights of the conference, as well as themes and discussion that tie into the papers presented in this issue.
DA  - 2019/03/26/
PY  - 2019
DO  - 10.1515/pdtc-2019-0002
VL  - 48
IS  - 1
SP  - 1
EP  - 2
SN  - 2195-2965
UR  - http://10.0.5.235/pdtc-2019-0002
L4  - http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=135645989&lang=hu&site=ehost-live
L4  - http://www.degruyter.com/view/j/pdtc.2019.48.issue-1/pdtc-2019-0002/pdtc-2019-0002.xml
KW  - DIGITIZATION of archival materials
KW  - TECHNOLOGICAL innovations
KW  - WEB archiving
KW  - Archival Preservation
KW  - Community Archiving
KW  - DIGITAL libraries
KW  - Digital Preservation
KW  - DIGITAL technology
KW  - Personal Digital Archiving (PDA)
ER  - 

TY  - ELEC
TI  - Netlab Web Archiving Course Brochure
AU  - Netlab
T2  - Course Brochure
DA  - 2018///
PY  - 2018
Y2  - 2019/01/28/
L1  - http://www.netlab.dk/wp-content/uploads/2017/04/NetLab-Web-Archiving-Course-Brochure.pdf
ER  - 

TY  - ELEC
TI  - Netlab-courses
AU  - Netlab
T2  - Netlab Course page
DA  - 2018///
PY  - 2018
UR  - http://www.netlab.dk/services/courses/
Y2  - 2019/01/28/
ER  - 

TY  - ELEC
TI  - IIPC Training Working Group survey
AU  - IIPC TWG
T2  - IIPC Training Working Group Survey
DA  - 2017///
PY  - 2017
UR  - https://www.surveymonkey.com/r/V7MVXXW
Y2  - 2018/06/12/
ER  - 

TY  - ELEC
TI  - IIPC portal
AU  - IIPC
DA  - 2019///
PY  - 2019
Y2  - 2019/01/28/
ER  - 

TY  - ELEC
TI  - Shine Project Historical Research Prototype
AU  - WEB Archive UK
AU  - JICS
AB  - This tool has been developed as part of the Big UK Data Arts and Humanities project funded by the AHRC. Read more about the project on our blog. The data was acquired by JISC from the Internet Archive (IA) and includes all .uk websites in the IA web collection crawled between around 1996 until April 2013 (you can read more details about this dataset here). The collection comprises over 3.5 billion items (urls, images and other documents) and has been full-text indexed by the UK Web Archive. Every word of every website in the collection can be searched for and analysed.
DA  - 2015///
PY  - 2015
UR  - https://www.webarchive.org.uk/shine
ER  - 

TY  - JOUR
TI  - Digital curation: the development of a discipline within information science
AU  - Higgins, Sarah
T2  - Journal of Documentation
AB  - Digital curation addresses the technical, administrative and financial ecology required to ensure that digital information remains accessible and usable over the long term. The purpose of this paper is to trace digital curation’s disciplinary emergence and examine its position within the information sciences domain in terms of theoretical principles, using a case study of developments in the UK and the USA.
DA  - 2018/10/08/
PY  - 2018
DO  - 10.1108/JD-02-2018-0024
VL  - 74
IS  - 6
SP  - 1318
EP  - 1338
SN  - 0022-0418
UR  - https://www.emeraldinsight.com/doi/10.1108/JD-02-2018-0024
KW  - Development
KW  - Digital curation
KW  - Education
KW  - History
KW  - Models
KW  - Professional associations
ER  - 

TY  - ELEC
TI  - Scientific American: The Semantic Web
DA  - 2020/08/14/
PY  - 2020
UR  - http://web.archive.org/web/20070713230811/http://www.sciam.com/print_version.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21
Y2  - 2020/08/14/
L2  - http://web.archive.org/web/20070713230811/http:/www.sciam.com/print_version.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21
ER  - 

TY  - BOOK
TI  - Metadata
AU  - Zeng, Marcia
AU  - Quin, Jian
DA  - 2016///
PY  - 2016
PB  - Facet Publishing
ER  - 

TY  - JOUR
TI  - Webes tartalmak digitális megőrzése
AU  - Drótos, László
T2  - Könyv, könyvtár, könyvtáros
DA  - 2018///
PY  - 2018
DP  - Zotero
VL  - 27
IS  - 10
SP  - 11
EP  - 17
J2  - 3K
LA  - hu
UR  - https://epa.oszk.hu/01300/01367/00307/pdf/EPA01367_3K_2018_10_011-017.pdf
Y2  - 2020/08/15/
L1  - https://epa.oszk.hu/01300/01367/00307/pdf/EPA01367_3K_2018_10_011-017.pdf
ER  - 

TY  - JOUR
TI  - Fair Use, Notice Failure, and the Limits of Copyright as Property
AU  - Liu, Joseph P.
T2  - Boston University Law Review
AB  - If we start with the assumption that copyright law creates a system of property rights, to what extent does this system give adequate notice to third parties regarding the scope of such rights, particularly given the prominent role played by the fair use doctrine? This essay argues that, although the fair use doctrine may provide adequate notice to sophisticated third parties, it fails to provide adequate notice to less sophisticated parties. Specifically, the fair use doctrine imposes nearly insuperable informational burdens upon the general public regarding the scope of the property entitlement and the corresponding duty to avoid infringement. Moreover, these burdens have only increased with changes in technology that enable more, and more varied, uses of copyrighted works. The traditional response to uncertainty in fair use has been to suggest ways of curing the notice failure by providing clearer rules about what is and is not permitted. This essay suggests, however, that these efforts to reinforce the property framework feel increasingly strained and fail to reflect how copyright law is actually experienced by the general public. Indeed, the extent of the notice failure is such that it may be time to stop treating copyright like a property right, at least for certain classes of users. The essay ends by suggesting a number of alternative frameworks that would seek to regulate public behavior regarding copyrighted works without imposing the unrealistic informational burdens required by a system of property rights.
DA  - 2016/05//
PY  - 2016
DP  - EBSCOhost
VL  - 96
IS  - 3
SP  - 833
EP  - 856
J2  - Boston University Law Review
SN  - 00068047
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=118226656&lang=hu&site=ehost-live
Y2  - 2020/08/17/06:58:54
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=118226656&S=R&D=a9h&EbscoContent=dGJyMMvl7ESep7U4v%2BvlOLCmsEieprBSs624TbGWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - COPYRIGHT infringement
KW  - COPYRIGHT notices
KW  - COPYRIGHT of digital media
KW  - FAIR use (Copyright)
KW  - PROPERTY rights
ER  - 

TY  - JOUR
TI  - Increasing Copyright Protection for Social Media Users by Expanding Social Media Platforms' Rights
AU  - Wichtowski, Ryan
T2  - Duke Law & Technology Review
AB  - Social media platforms allow users to share their creative works with the world. Users take great advantage of this functionality, as Facebook, Instagram, Flickr, Snapchat, and WhatsApp users alone uploaded 1.8 billion photos per day in 2014. Under the terms of service and terms of use agreements of most U.S. based social media platforms, users retain ownership of this content, since they only grant social media platforms nonexclusive licenses to their content. While nonexclusive licenses protect users vis-à-vis the social media platforms, these licenses preclude social media platforms from bringing copyright infringement claims on behalf of their users against infringers of user content under the Copyright Act of 1976. Since the average cost of litigating a copyright infringement case might be as high as two million dollars, the average social media user cannot protect his or her content against copyright infringers. To remedy this issue, Congress should amend 17 U.S.C. § 501 to allow social media platforms to bring copyright infringement claims against those who infringe their users' content. Through this amendment, Congress would create a new protection for social media users while ensuring that users retain ownership over the content they create.
DA  - 2017/00//
PY  - 2017
DP  - EBSCOhost
VL  - 16
IS  - 1
SP  - 253
EP  - 268
J2  - Duke Law & Technology Review
SN  - 23289600
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=127588435&lang=hu&site=ehost-live
Y2  - 2020/08/17/07:32:41
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=127588435&S=R&D=a9h&EbscoContent=dGJyMMvl7ESep7U4v%2BvlOLCmsEieprFSsai4SrWWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - COPYRIGHT -- United States
KW  - COPYRIGHT lawsuits
KW  - INTERNET users
KW  - MINDEN Pictures Inc.
KW  - SOCIAL media
KW  - UNITED States. Copyrights (1976)
ER  - 

TY  - CONF
TI  - Itsy-Bitsy Spider: A Look at Web Crawlers and Web Archiving
AU  - Oliveira, Caroline
C3  - Digital Preservation: CINE-GT 1807
DA  - 2017///
PY  - 2017
UR  - https://www.nyu.edu/tisch/preservation/program/student_work/2017fall/17f_1807_Oliveira_a2a.pdf
Y2  - 2020/08/17/07:39:46
L1  - https://www.nyu.edu/tisch/preservation/program/student_work/2017fall/17f_1807_Oliveira_a2a.pdf
ER  - 

TY  - CHAP
TI  - Web Crawling
AU  - Liu, Bing
AU  - Menczer, Filippo
T2  - Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
A2  - Liu, Bing
T3  - Data-Centric Systems and Applications
AB  - Web crawlers, also known as spiders or robots, are programs that automatically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored).
CY  - Berlin, Heidelberg
DA  - 2011///
PY  - 2011
DP  - Springer Link
SP  - 311
EP  - 362
LA  - en
PB  - Springer
SN  - 978-3-642-19460-3
UR  - https://doi.org/10.1007/978-3-642-19460-3_8
Y2  - 2020/08/17/07:51:22
KW  - Search Engine
KW  - Anchor Text
KW  - Cosine Similarity
KW  - Domain Name System
KW  - Priority Queue
ER  - 

TY  - JOUR
TI  - Counter-archiving Facebook
AU  - Ben-David, Anat
T2  - European Journal of Communication
AB  - The article proposes archival thinking as an analytical framework for studying Facebook. Following recent debates on data colonialism, it argues that Facebook dialectically assumes a role of a new archon of public records, while being unarchivable by design. It then puts forward counter-archiving – a practice developed to resist the epistemic hegemony of colonial archives – as a method that allows the critical study of the social media platform, after it had shut down researcher's access to public data through its application programming interface. After defining and justifying counter-archiving as a method for studying datafied platforms, two counter-archives are presented as proof of concept. The article concludes by discussing the shifting boundaries between the archivist, the activist and the scholar, as the imperative of research methods after datafication.
DA  - 2020/06//
PY  - 2020
DO  - 10.1177/0267323120922069
DP  - EBSCOhost
VL  - 35
IS  - 3
SP  - 249
EP  - 264
J2  - European Journal of Communication
SN  - 02673231
L1  - https://journals.sagepub.com/doi/pdf/10.1177/0267323120922069
L2  - http://search.ebscohost.com/login.aspx?direct=true&db=asn&AN=144258249&lang=hu&site=ehost-live&scope=cite
KW  - archive
KW  - SOCIAL media
KW  - APPLICATION program interfaces
KW  - Application programming interface
KW  - datafication
KW  - Facebook
KW  - FACEBOOK (Web resource)
KW  - methods
KW  - POLITICAL advertising
KW  - PUBLIC records
ER  - 

TY  - ELEC
TI  - Save a UK website
AU  - British Library
DA  - 2020///
PY  - 2020
UR  - https://www.webarchive.org.uk/en/ukwa/info/nominate
Y2  - 2020/08/17/10:29:21
L2  - https://www.webarchive.org.uk/en/ukwa/info/nominate
ER  - 

TY  - CHAP
TI  - A webarchiválás oktatása = The education of web-archiving
AU  - Drótos, László
AU  - Németh, Márton
T2  - NETWORKSHOP 2018 konferenciakiadvány
AB  - The article is focusing on three main issues. At first, an overview is being offered about an online research seminar for PhD students and web-archiving professionals organized by the NETLAB Research group, Aarhus University, Denmark. Secondly, the recently established Education and Training Working Group of the IIPC consortium is being introduced. A quick overview is being offered about a brief survey on best web archiving education practices and future. Thirdly, a Hungarian web-archiving training concept is being described. The training will be organized by the Library Institute for any kind of cultural heritage professionals that want to get basic skills and competences in this field.
CY  - Budapest
DA  - 2018///
PY  - 2018
DP  - real.mtak.hu
SP  - 31
EP  - 37
LA  - hu
PB  - HUNGARNET Egyesület
UR  - https://doi.org/10.31915/NWS.2018.4
Y2  - 2020/08/17/14:25:46
L1  - http://real.mtak.hu/64076/1/drotosnemeth_VEGLEGES.pdf
L2  - http://real.mtak.hu/64076/
ER  - 

TY  - SLIDE
TI  - The education of web archiving -abstract
T2  - Information Interactions conference
A2  - Németh Márton, Drótos László
AB  - How to catalogue a web archive? Some solutions for metadata
management at the web harvesting pilot project of National Széchényi
Library, Hungary
CY  - Bratislava
DA  - 2018///
PY  - 2018
LA  - en
UR  - https://fphil.uniba.sk/fileadmin/fif/katedry_pracoviska/kkiv/Konferencie/Information_interactions_book_of_abstracts_2018.pdf
Y2  - 2020/08/17/
L1  - https://fphil.uniba.sk/fileadmin/fif/katedry_pracoviska/kkiv/Konferencie/Information_interactions_book_of_abstracts_2018.pdf
ER  - 

TY  - SLIDE
TI  - The education of web archiving
T2  - CDA 2018 conference
A2  - Németh, Márton
CY  - Bratislava
DA  - 2018///
PY  - 2018
UR  - http://mekosztaly.oszk.hu/mia/doc/The_education_of_web-archiving_CDA_2018.pptx
Y2  - 2020/08/17/
ER  - 

TY  - SLIDE
TI  - HOW TO DIG INTO THE HISTORY OF A NATION'S WEB? THE DEVELOPMENT OF THE DANISH WEB 2006-2015
T2  - IIPC RWG webinar series
A2  - Brügger, Niels
CY  - online webinar
DA  - 2018///
PY  - 2018
LA  - en
ER  - 

TY  - JOUR
TI  - Webarchiválás két szakmai rendezvény tükrében
AU  - Németh, Márton
T2  - Könyv, könyvtár, könyvtáros
DA  - 2019///
PY  - 2019
DP  - Zotero
VL  - 28
IS  - 6
SP  - 26
EP  - 29
J2  - 3K
LA  - hu
L1  - https://epa.oszk.hu/01300/01367/00316/pdf/EPA01367_3K_2019_06_026-029.pdf
ER  - 

TY  - BLOG
TI  - Training materials
AU  - IIPC  Training Working Group
T2  - IIPC
AB  - ... Read More
DA  - 2020///
PY  - 2020
LA  - en-GB
UR  - http://netpreserve.org/web-archiving/training-materials/
Y2  - 2020/08/17/15:10:26
L2  - https://netpreserve.org/web-archiving/training-materials/
ER  - 

TY  - SLIDE
TI  - A MIA pilot rövid bemutatása
T2  - 404 Not Found” workshop
A2  - Kampis, György
CY  - Budapest
DA  - 2017///
PY  - 2017
LA  - hu-HU
UR  - https://webarchivum.oszk.hu/wp-content/uploads/2020/03/Kampis_Gyorgy_MIA_pilot_GK_ea.pptx
Y2  - 2020/08/17/15:18:30
L2  - https://webarchivum.oszk.hu/szakembereknek/404-not-found-workshop/404-not-found-workshop-2017-oktober-13/
ER  - 

TY  - SLIDE
TI  - How to catalogue a web archive?
T2  - Information Interactions 2018 workshop
A2  - Németh, Márton
CY  - Bratislava
DA  - 2018///
PY  - 2018
LA  - English
UR  - http://mekosztaly.oszk.hu/mia/doc/How_to_catalogue_a_web_archive.pptx
ER  - 

TY  - JOUR
TI  - Rákóczi-archívum
AU  - Visky, Ákos László
AU  - Drótos, László
T2  - Könyv, könyvtár, könyvtáros
DA  - 2020///
PY  - 2020
VL  - 29
IS  - 3
SP  - 35
EP  - 48
J2  - 3K
LA  - magyar
UR  - https://epa.oszk.hu/01300/01367/00326/pdf/EPA01367_3K_2020_03_035-048.pdf
Y2  - 2020/08/17/15:55:25
L1  - https://epa.oszk.hu/01300/01367/00326/pdf/EPA01367_3K_2020_03_035-048.pdf
ER  - 

TY  - BLOG
TI  - IIPC Content Development Working Group
T2  - IIPC
AB  - ... Read More
DA  - 2019///
PY  - 2019
LA  - en-GB
UR  - http://netpreserve.org/about-us/working-groups/content-development-working-group/
Y2  - 2020/08/17/16:02:07
L2  - https://netpreserve.org/about-us/working-groups/content-development-working-group/
ER  - 

TY  - JOUR
TI  - Az OSZK Webarchívum új honlapjának felépítése és szolgáltatásai
AU  - Németh, Márton
T2  - Könyv, könyvtár, könyvtáros
DA  - 2020///
PY  - 2020
DP  - Zotero
VL  - 29.
IS  - 6.
SP  - 16
EP  - 26
J2  - 3K
LA  - hu
UR  - https://epa.oszk.hu/01300/01367/00329/pdf/EPA01367_3K_2020_06_016-026.pdf
Y2  - 2020/08/18/
L1  - https://epa.oszk.hu/01300/01367/00329/pdf/EPA01367_3K_2020_06_016-026.pdf
ER  - 

TY  - JOUR
TI  - Gyorsmérleg – az OSZK Webarchívumának és néhány könyvtárnak  a KDS-K pályázat keretében történt együttműködésérő
AU  - Visky, Ákos László
T2  - Könyv, könyvtár, könyvtáros
AB  - Most, hogy a végéhez értünk a Közgyűjteményi Digitalizálási Stratégia könyvtári ágának keretében (KDS-K) megvalósult együttműködésnek – mely során az Országos Széchényi Könyvtár (OSZK) Webarchívuma és a pályázatban nyertesként részt vevő megyei hatókörű városi könyvtárak elsősorban a nemzeti webtér regionális vonatkozású webhelyeinek feltárásában működtek együtt –, illő rövid összefoglalót adni a közös munka eredményéről.
DA  - 2020///
PY  - 2020
VL  - 29.
IS  - 7-8.
LA  - magyar
ER  - 

TY  - CHAP
TI  - Web Archives as a Research Subject
AU  - Németh, Márton
T2  - Information and technology transforming lives: connection, interaction, innovation proceedings
AB  - Initial discussions about web archiving are mainly focusing on the archiving workflow, the technical details of the archiving process and several curatorial tasks that are appearing in this
context. A good overview is being offered about these basic topics in the preface of (Brügger and
Schroeder, 2017) Even an overview was offered also on this perspective in the 2018 Bobcatsss
Conference (Drótos and Németh, 2018). In this paper we would like to highlight another major
perspective of web archiving. The web archived material can be defined as a major research subject itself. Our aim by this article is to offer an overview about several new research disciplinary
frameworks, perspectives based on the studying of web archives. Librarians, archivists, information scientists, professionals in Digital Humanities, data scientists and IT-developers can
work together on analysing large archived web corpora focusing on several structural and content-based features. New scientific disciplines have emerged through these research activities
in the past ten years such as web history (Brügger, 2009). The major focus in this sense to find
out new perspectives, outcomes from a historical perspective by analysing selected segments of
the archived web content. Focusing on the history of the web itself is also a relevant research
topic (Brügger, 2016; Brügger et al., 2017). A new type of collaboration is appearing among
historians and data scientists by combining quantitative data science analysis, data visualization, and other data research methods and tools with qualitative historical research activities.
(Ben-David, Amram and Bekkerman, 2018; Brügger, 2013; Brügger and Schroeder, 2017;
Ogden, Halford and Carr, 2017) Furthermore web archive is appearing itself as a large dataset
and as a whole a major research target of research projects in big data analysis field. (Lnenicka,
Hovad and Komarkova, 2015; Maemura, Becker and Milligan, 2016). In a close relationship 
472
BOBCATSSS
2019 OSIJEK
with data science-based research activities a major challenge in information retrieval field the
ways of application of semantic web tools in order to ensure effective retrieval of the archived
materials based on the meaning of information can be found in a web corpus (Demidova et al.,
2014; Fafalios et al.,2018; Fafalios, Kasturia and Nejdl, 2018; Gossen, Demidova and Risse,
2016; Souza et al., 2015).
CY  - Osijek
DA  - 2019///
PY  - 2019
DP  - Open WorldCat
SP  - 471
EP  - 478
LA  - en
PB  - Faculty of humanities and social sciences, University of Osijek
SN  - 978-953-314-121-3
UR  - http://bobcatsss2019.ffos.hr/docs/bobcatsss_proceedings.pdf
Y2  - 2020/08/19/08:23:05
L1  - http://bobcatsss2019.ffos.hr/docs/bobcatsss_proceedings.pdf
ER  - 

TY  - JOUR
TI  - Metadata Management and Future Plans to Generate Linked Open Data in the Hungarian Web Archiving Pilot Project
AU  - Németh, Márton
AU  - Drótos, László
T2  - ITLIB
AB  - In this article we would like to offer a short overview about the metadata management model of our web archiving
pilot project together with international recommendations as a major background of modelling. It is including an
outlook to the scope of metadata management (archive-level and website-level), an overview of major metadata
types and description of some major metadata fields (more than one hundred fields are available). Metadata
based full-text search and retrieval capabilities are also being described in the article.
The second chapter of the article points out that the absence of efficient and meaningful exploration methods of
the archived content is a really major hurdle in the way to turn web archives to a usable and useful information
resource. A major challenge in information science can be the adaptation of semantic web tools and methods to
web archive environments. The web archives must be a part of the linked data universe with advanced query and
integration capabilities, and must be able to directly exploitable by other systems and tools. We would like to
describe some basic considerations in order to successfully manage this semantic web integration process as
a plan to the future.
DA  - 2019///
PY  - 2019
VL  - 2019
IS  - 2
LA  - English
UR  - https://itlib.cvtisr.sk/buxus/docs/38-metadata.pdf
Y2  - 2020/08/19/14:31:33
L1  - https://itlib.cvtisr.sk/buxus/docs/38-metadata.pdf
ER  - 

TY  - JOUR
TI  - IFLA könyvtári referenciamodell
AU  - Riva, Pat
AU  - Boeuf, Patrick Le
AU  - Žumer, Maja
AU  - Szerkesztőbizottsága, Egységesítési
DA  - 2017///
PY  - 2017
DP  - Zotero
SP  - 99
LA  - hu
L1  - https://www.ifla.org/files/assets/cataloguing/frbrrg/ifla_lrm_2017_hun_v3.pdf
ER  - 

TY  - CHAP
TI  - A Reference Model for a Trusted Service Guaranteeing Web-content
AU  - Togan, Mihai
AU  - Florea, Ionut
T2  - ISSE 2015
A2  - Reimer, Helmut
A2  - Pohlmann, Norbert
A2  - Schneider, Wolfgang
CY  - Wiesbaden
DA  - 2015///
PY  - 2015
DP  - DOI.org (Crossref)
SP  - 216
EP  - 224
LA  - en
PB  - Springer Fachmedien Wiesbaden
SN  - 978-3-658-10933-2 978-3-658-10934-9
UR  - http://link.springer.com/10.1007/978-3-658-10934-9_18
Y2  - 2020/08/20/07:42:44
ER  - 

TY  - SLIDE
TI  - Trust and the Web: Can the Audit Checklist be Applied to Web Archives?
A2  - Clifton, Gerard
DA  - 2016///
PY  - 2016
UR  - https://openresearch-repository.anu.edu.au/bitstream/1885/47021/5/clifton.pdf
Y2  - 2020/08/20/07:44:34
L1  - https://openresearch-repository.anu.edu.au/bitstream/1885/47021/5/clifton.pdf
ER  - 

TY  - JOUR
TI  - Supporting Web Archiving via Web Packaging
AU  - Alam, Sawood
AU  - Weigle, Michele C
AU  - Nelson, Michael L
AU  - Klein, Martin
AB  - We describe challenges related to web archiving, replaying archived web resources, and verifying their authenticity. We show that Web Packaging has significant potential to help address these challenges and identify areas in which changes are needed in order to fully realize that potential.
DA  - 2019///
PY  - 2019
DP  - Zotero
SP  - 3
LA  - en
L1  - https://www.iab.org/wp-content/IAB-uploads/2019/06/sawood-alam-2.pdf
ER  - 

TY  - RPRT
TI  - Digital Preservation and Authentic Legal Information
AU  - Flanagan, G. Patrick
AB  - Writing and researching about the permanence of digital documents is a quizzical, self-referential activity. I set out to uncover approaches to the problems facing the longevity of authentic legal information. How did I do this? Primarily, I accessed and read electronic documents. My exercise here might very well suffer the same issues raised in the information science and legal literature. Faulty, inconsistent, and potentially inauthentic electronic databases may unduly – however subtly – shade my analysis. For what I’m doing here – a student’s attempt to add to an academic discourse – I’m pretty unconcerned
CY  - Rochester, NY
DA  - 2010///
PY  - 2010
DP  - papers.ssrn.com
LA  - en
M3  - SSRN Scholarly Paper
PB  - Social Science Research Network
SN  - ID 2463288
UR  - https://papers.ssrn.com/abstract=2463288
Y2  - 2020/08/20/08:20:14
L1  - https://pdfs.semanticscholar.org/b89c/f4c1fdcfb70d8bd5d5f6f23ad0547154ffc5.pdf
L2  - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2463288
KW  - digital preservation
KW  - convergence
KW  - legal information
KW  - obsolescence
ER  - 

TY  - JOUR
TI  - Accountability and accessibility: ensuring the evidence of e‐governance in Australia
AU  - Cunningham, Adrian
AU  - Phillips, Margaret
T2  - Aslib Proceedings
A2  - Auty, Caroline
DA  - 2005/08//
PY  - 2005
DO  - 10.1108/00012530510612059
DP  - DOI.org (Crossref)
VL  - 57
IS  - 4
SP  - 301
EP  - 317
J2  - AP
LA  - en
SN  - 0001-253X
ST  - Accountability and accessibility
UR  - https://www.emerald.com/insight/content/doi/10.1108/00012530510612059/full/html
Y2  - 2020/08/20/08:21:34
ER  - 

TY  - JOUR
TI  - Web Archives: The Future(s)
AU  - Meyer, Eric T.
AU  - Thomas, Arthur
AU  - Schroeder, Ralph
T2  - SSRN Electronic Journal
DA  - 2011///
PY  - 2011
DO  - 10.2139/ssrn.1830025
DP  - DOI.org (Crossref)
J2  - SSRN Journal
LA  - en
SN  - 1556-5068
ST  - Web Archives
UR  - http://www.ssrn.com/abstract=1830025
Y2  - 2020/08/20/09:26:51
L1  - https://digital.library.unt.edu/ark:/67531/metadc1457753/m2/1/high_res_d/2011_06_IIPC_WebArchives-TheFutures.pdf
ER  - 

TY  - JOUR
TI  - Challenges for the national, regional and thematic Web Archiving and their Use
AU  - Risse, Thomas
AU  - Nejdl, Wolfgang
T2  - Zeitschrift für Bibliothekswesen und Bibliographie
AB  - The World Wide Web is well established as a global information and communication medium. New technologies regularly come along which expand the forms of use and permit even inexperienced users to publish content or take part in discussions. For this reason the Web can also be seen as a good documenter of present-day society. The dynamism of the Web means that its content is, by its very nature, transitory, and new technologies and forms of use regularly present new challenges for the collection of web content for web archiving. Static pages still dominated in the early days of web archiving, whereas many dynamic types of content have now arisen which integrate information from different sources. There is now growing interest from various research disciplines in conventional domain-oriented web harvesting, in thematic web collections and in their use and exploration. This article examines a number of challenges and possible methods of collecting thematic and dynamic content from the Web and social media. Current problems which have arisen in academic use are discussed, and it is shown how web archives and other temporal collections can be searched more effectively.
DA  - 2015///
PY  - 2015
DP  - ResearchGate
VL  - 62
SP  - 160
EP  - 171
J2  - Zeitschrift für Bibliothekswesen und Bibliographie
L1  - https://www.researchgate.net/profile/Thomas_Risse/publication/286867733_Challenges_for_the_national_regional_and_thematic_Web_Archiving_and_their_Use/links/58998d89aca2721f0db0baa5/Challenges-for-the-national-regional-and-thematic-Web-Archiving-and-their-Use.pdf
L4  - https://www.researchgate.net/publication/286867733_Challenges_for_the_national_regional_and_thematic_Web_Archiving_and_their_Use
ER  - 

TY  - ELEC
TI  - International Internet Preservation Consortium (IIPC) repository
T2  - UNT Digital Library
AB  - The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.
DA  - 2020///
PY  - 2020
LA  - en
UR  - https://digital.library.unt.edu/explore/partners/IIPC/
Y2  - 2020/08/20/09:33:01
L2  - https://digital.library.unt.edu/explore/partners/IIPC/
L2  - https://digital.library.unt.edu/explore/partners/IIPC/#explore
ER  - 

TY  - JOUR
TI  - The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times
AU  - Zitrain, Jonathan
AU  - Bowers, John
AU  - Stanton, Clare
AB  - Hyperlinks are a powerful tool for journalists and their readers. Diving deep into the context of an article is just a click away. But hyperlinks are a double-edged sword; for all of the internet’s boundlessness, what’s found on the web can also be modified, moved, or entirely disappeared. This often-irreversible decay of web content is commonly known as linkrot. It comes with a similar problem of content drift, or the often-unannounced changes––retractions, additions, replacement––to the content at a particular URL.

Our team of researchers at Harvard Law School has undertaken a project to gain insight into the extent and characteristics of journalistic linkrot and content drift. We examined hyperlinks in New York Times articles starting with the launch of the Times website in 1996 up through mid-2019, developed on the basis of a dataset provided to us by the Times. We focus on the Times not because it is an influential publication whose archives are often used to help form a historical record. Rather, the substantial linkrot and content drift we find here across the New York Times corpus accurately reflects the inherent difficulties of long-term linking to pieces of a volatile web.

Results show a near linear increase of linkrot over time, with interesting patterns emerging within certain sections of the paper or across top level domains. Over half of articles containing at least one URL also contained a dead link. Additionally, of the ostensibly “healthy” links existing in articles, a hand review revealed additional erosion to citations via content drift.
DA  - 2021/05/26/
PY  - 2021
DO  - https://dx.doi.org/10.2139/ssrn.3833133
LA  - English
UR  - https://ssrn.com/abstract=3833133
Y2  - 2021/05/26/
ER  - 

TY  - JOUR
TI  - Web-archiving and social media: an exploratory analysis
AU  - Vlassenroot, Eveline
AU  - Chambers, Sally
AU  - Lieber, Sven
AU  - Michel, Alejandra
AU  - Geeraert, Friedel
AU  - Pranger, Jessica
AU  - Birkholz, Julie
AU  - Mechant, Peter
T2  - International Journal of Digital Humanities
AB  - The archived web provides an important footprint of the past, documenting online social behaviour through social media, and news through media outlets websites and government sites. Consequently, web archiving is increasingly gaining attention of heritage institutions, academics and policy makers. The importance of web archives as data resources for (digital) scholars has been acknowledged for investigating the past. Still, heritage institutions and academics struggle to ‘keep up to pace’ with the fast evolving changes of the World Wide Web and with the changing habits and practices of internet users. While a number of national institutions have set up a national framework to archive ‘regular’ web pages, social media archiving (SMA) is still in its infancy with various countries starting up pilot archiving projects. SMA is not without challenges; the sheer volume of social media content, the lack of technical standards for capturing or storing social media data and social media’s ephemeral character can be impeding factors. The goal of this article is three-fold. First, we aim to extend the most recent descriptive state-of-the-art of national web archiving, published in the first issue of International Journal of Digital Humanities (March 2019) with information on SMA. Secondly, we outline the current legal, technical and operational (such as the selection and preservation policy) aspects of archiving social media content. This is complemented with results from an online survey to which 15 institutions responded. Finally, we discuss and reflect on important challenges in SMA that should be considered in future archiving projects.
DA  - 2021/06/22/
PY  - 2021
DO  - 10.1007/s42803-021-00036-1
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
ST  - Web-archiving and social media
UR  - https://doi.org/10.1007/s42803-021-00036-1
Y2  - 2021/07/15/07:35:28
L1  - https://link.springer.com/content/pdf/10.1007%2Fs42803-021-00036-1.pdf
ER  - 

TY  - JOUR
TI  - Web archiving of indigenous knowledge systems in South Africa
AU  - Balogun, Tolulope
AU  - Kalusopa, Trywell
T2  - Information Development
AB  - The purpose of the paper was to highlight the digitization of Indigenous Knowledge Systems (IKS) in institutional repositories in South Africa with a view to develop a framework for Web archiving IKS-related websites in South Africa. Anchored on the interpretivist paradigm, the qualitative research method was adopted for this research. The multiple case study research strategy was considered appropriate for the study. Data was gathered through face-to-face in-depth interviews and content analysis. Interviews were conducted with eight IKS staff at the IKS Documentation Centres across four provinces in South Africa. The study revealed that although there are efforts to digitize IKS and make them accessible through some channels online, there are no specific digital preservation policies guiding the project. Apart from the fact that there are policies in place to support any Web archiving initiative, the concept of Web archiving was generally unfamiliar to the respondents. The respondents admitting to the lack of a standard policy guiding the digitization project also admitted to a lack of knowledge or in-depth understanding of Web archiving and its prospect as a digital preservation measure. The research, therefore, proposes a Web archiving framework that should be incorporated in the digital preservation policy framework. This research will be useful to policymakers and all stakeholders in South Africa and other parts of Africa.
DA  - 2021/04/08/
PY  - 2021
DO  - 10.1177/02666669211005522
DP  - SAGE Journals
SP  - 02666669211005522
J2  - Information Development
LA  - en
SN  - 0266-6669
UR  - https://doi.org/10.1177/02666669211005522
Y2  - 2021/07/15/07:36:15
KW  - web archiving
KW  - archives
KW  - indigenous knowledge systems
KW  - institutional repositories
KW  - South Africa
ER  - 

TY  - JOUR
TI  - Design of an Enhanced Web Archiving System for Preserving Content Integrity with Blockchain
AU  - Hwang, Hyun Cheon
AU  - Shon, Jin Gon
AU  - Park, Ji Su
T2  - Electronics
AB  - A Web archive system is a traditional subject for preserving web content for the future and the importance is getting more significant due to the explosive growth of web content. The reference model for an open archival information system (OAIS) has been advising guidance for a long-term archiving system and most organizations that archive web content follow this guidance. In addition, the web archive (WARC) ISO standard is for web content archiving. However, there is no way to secure content integrity, and it is hard to identify the original. Because of limitations, a web archive system has a weakness against the dispute of content integrity. In this paper, we proposed the blockchain linked (BCLinked) web archiving system, which uses blockchain technology and an extended WARC field to keep a web content integrity metadata into a blockchain. Furthermore, we designed the BCLinked web archiving system, and we confirmed the proposed system secures content integrity through the experiment.
DA  - 2020/08//
PY  - 2020
DO  - 10.3390/electronics9081255
DP  - www.mdpi.com
VL  - 9
IS  - 8
SP  - 1255
LA  - en
UR  - https://www.mdpi.com/2079-9292/9/8/1255
Y2  - 2021/07/15/07:36:56
L1  - https://www.mdpi.com/2079-9292/9/8/1255/pdf
L2  - https://www.mdpi.com/2079-9292/9/8/1255
KW  - WARC
KW  - web archive
KW  - web crawling
KW  - BCLinked
KW  - blockchain
KW  - web archiving system
ER  - 

TY  - CONF
TI  - Dynamic Classification in Web Archiving Collections
AU  - Patel, Krutarth
AU  - Caragea, Cornelia
AU  - Phillips, Mark
T2  - LREC 2020
AB  - The Web archived data usually contains high-quality documents that are very useful for creating specialized collections of documents. To create such collections, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the large collections (of millions in size) from Web Archiving institutions. However, the patterns of the documents of interest can differ substantially from one document to another, which makes the automatic classification task very challenging. In this paper, we explore dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types. Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.
C1  - Marseille, France
C3  - Proceedings of the 12th Language Resources and Evaluation Conference
DA  - 2020/05//
PY  - 2020
DP  - ACLWeb
SP  - 1459
EP  - 1468
LA  - English
PB  - European Language Resources Association
SN  - 979-10-95546-34-4
UR  - https://aclanthology.org/2020.lrec-1.182
Y2  - 2021/07/15/07:37:39
L1  - https://aclanthology.org/2020.lrec-1.182.pdf
ER  - 

TY  - CHAP
TI  - Web Archiving in Singapore: The Realities of National Web Archiving
AU  - Lee, Ivy Huey Shin
AU  - Tay, Shereen
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - This chapter describes the challenges of web archiving in Singapore, where awareness about web archiving is low. It focuses on the legislation amendments, infrastructure enhancements and public engagement activities required to facilitate web archiving work at a national scale.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 33
EP  - 42
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
ST  - Web Archiving in Singapore
UR  - https://doi.org/10.1007/978-3-030-63291-5_4
Y2  - 2021/07/15/07:38:29
ER  - 

TY  - THES
TI  - Saving the Web: Facets of Web Archiving in Everyday Practice
AU  - Ogden, Jessica Rose
AB  - This thesis makes visible the work of archiving the Web. It demonstrates the growing role of web archives (WAs) in the circulation of information and culture online, and emphasises the inherent connections between how the Web is archived, its future use and our understandings of WAs, archivists and the Web itself. As the first in-depth sociotechnical study of web archiving, this research offers a view into the ways that web archivists are shaping what and how the Web is saved for the future. Using a combination of ethnographic observation, interviews and documentary sources, the thesis investigates web archiving at three sites: the Internet Archive – the world’s largest web archive; Archive Team – ‘a loose collective of rogue archivists and programmers’ archiving the Web; and the Environmental Data &amp; Governance Initiative (EDGI) –a community of academics, librarians and activists formed in the wake of the 2016US Presidential Election to safe-guard environmental and climate data. Through the application of practice theory, thematic analysis and facet methodology, I frame my findings through three ‘facets of web archiving’: infrastructure, culture and politics.I show that the web archival activities of organisations, people and bots are both historically-situated and embedded in the contemporary politics of online communication and information sharing. WAs are reflected on as ‘places’ where the past,present and future of the Web collapses around an evolving assemblage of sociotechnical practices and actors dedicated to enabling different (and at times, conflicting)community-defined imaginaries for the Web. WAs are revealed to be contested sites where these politics are enabled and enacted over time. This thesis therefore contributes to research on the performance of power and politics on the Web, and raises new questions concerning how different communities negotiate the challenges of ephemerality and strive to build the ‘Web they want’.iii
DA  - 2020/08//
PY  - 2020
DP  - eprints.soton.ac.uk
SP  - 260
LA  - en
M3  - phd
PB  - University of Southampton
ST  - Saving the Web
UR  - https://eprints.soton.ac.uk/447624/
Y2  - 2021/07/15/07:40:00
L1  - https://eprints.soton.ac.uk/447624/1/JOgden_PhD_Thesis_Final.pdf
L2  - https://eprints.soton.ac.uk/447624/
ER  - 

TY  - CONF
TI  - International Initiatives and Advances in Brazil for Government Web Archiving
AU  - Melo, Jonas Ferrigolo
AU  - Rockembach, Moisés
A2  - Bisset Álvarez, Edgar
T3  - Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
AB  - This study aimed to illustrate some government web archiving initiatives in several countries and stablish an overview of the Brazilian scenario with regard to the preservation of content published on government websites. In Brazil, although there is a robust set of laws that determine the State to manage, access and preserve its documents and information, there is still no policy for the preservation of web content. The result is the erasure and permanent loss of government information produced exclusively through websites. It is noticed that there are several government initiatives for web archiving around the world, which can be used as examples for the implementation of a Brazilian policy. It is concluded that the long-term maintenance of governmental information available on the web is fundamental for public debate and for monitoring governmental actions. To ensure the preservation of this content, the country must define its policy for the preservation of documents produced in a web environment.
C1  - Cham
C3  - Data and Information in Online Environments
DA  - 2021///
PY  - 2021
DO  - 10.1007/978-3-030-77417-2_6
DP  - Springer Link
SP  - 83
EP  - 95
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-77417-2
KW  - Web archiving
KW  - Digital preservation
KW  - Websites
KW  - Government web archiving
ER  - 

TY  - CONF
TI  - A Web Archiving Method for Preserving Content Integrity by Using Blockchain
AU  - Hwang, Hyun Cheon
AU  - Park, Ji Su
AU  - Lee, Byung Rae
AU  - Shon, Jin Gon
A2  - Park, James J.
A2  - Fong, Simon James
A2  - Pan, Yi
A2  - Sung, Yunsick
T3  - Lecture Notes in Electrical Engineering
AB  - A web archive system has become an essential topic for preserving historical information for descendants with the explosive growth of web data. The reference model for an Open Archival Information System (OAIS) has been providing an excellent guide for a long-term archiving system, and most of web archive systems follow this guide. However, there is still a weak point in terms of content integrity due to the archival web data could be altered by unauthorized manner. In this paper, we proposed the BCLinked (Blockchain Linked) web archiving method which uses blockchain technology and an extended WARC (Web ARChive) file format to ensure the content integrity. Furthermore, we confirmed the proposed method ensures content integrity through the experiment.
C1  - Singapore
C3  - Advances in Computer Science and Ubiquitous Computing
DA  - 2021///
PY  - 2021
DO  - 10.1007/978-981-15-9343-7_47
DP  - Springer Link
SP  - 341
EP  - 347
LA  - en
PB  - Springer
SN  - 9789811593437
KW  - WARC
KW  - OAIS
KW  - Web archive
KW  - BCLinked web archiving method
KW  - Blockchain
ER  - 

TY  - JOUR
TI  - A Supplementary Tool for Web-archiving Using Blockchain Technology
AU  - de Villiers, John E.
AU  - Calitz, André P.
T2  - The African Journal of Information and Communication
DA  - 2020///
PY  - 2020
DO  - 10.23962/10539/29194
DP  - SciELO
VL  - 25
SP  - 1
EP  - 14
SN  - 2077-7213
UR  - http://www.scielo.org.za/scielo.php?script=sci_abstract&pid=S2077-72132020000100003&lng=en&nrm=iso&tlng=en
Y2  - 2021/07/15/07:42:19
L1  - http://www.scielo.org.za/pdf/ajic/v25/03.pdf
L2  - http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2077-72132020000100003
ER  - 

TY  - CONF
TI  - Web Archiving and Digital Libraries
AU  - Xie, Zhiwu
AU  - Klein, Martin
AU  - Fox, Edward A.
T3  - JCDL '20
AB  - This workshop will explore integration of Web archiving and digital libraries and cover all stages of its complete life cycle, including creation/authoring, uploading/publishing, crawling, indexing, exploration, and archiving, etc. It will include particular coverage of current topics of interest, like: big data, social media archiving, and systems.
C1  - New York, NY, USA
C3  - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
DA  - 2020/00/01/
PY  - 2020
DO  - 10.1145/3383583.3398509
DP  - ACM Digital Library
SP  - 583
EP  - 584
PB  - Association for Computing Machinery
SN  - 978-1-4503-7585-6
UR  - https://doi.org/10.1145/3383583.3398509
Y2  - 2021/07/15/
L1  - https://vtechworks.lib.vt.edu/bitstream/handle/10919/98566/WADL_2020.pdf?sequence=1
L1  - https://dl.acm.org/doi/pdf/10.1145/3383583.3398509
L1  - https://vtechworks.lib.vt.edu/bitstream/10919/98566/1/WADL_2020.pdf
L2  - https://dl.acm.org/doi/abs/10.1145/3383583.3398509
KW  - web archiving
KW  - digital preservation
KW  - community building
ER  - 

TY  - JOUR
TI  - Web archiving of the Covid crisis in Europe : Close reading's challenges
AU  - Schafer, Valerie
DA  - 2021/07/02/
PY  - 2021
DP  - orbilu.uni.lu
LA  - en
ST  - Web archiving of the COVID crisis in Europe
UR  - https://orbilu.uni.lu/handle/10993/47559
Y2  - 2021/07/15/07:45:16
L2  - https://orbilu.uni.lu/handle/10993/47559
L2  - https://orbilu.uni.lu/handle/10993/47559
ER  - 

TY  - JOUR
TI  - Web Archiving as Culture: Tumblr and The Cultural Construction of the Archived Web
AU  - Ogden, Jessica
T2  - AoIR Selected Papers of Internet Research
AB  - Web archives - broadly conceived as any attempt to capture and preserve the Web for future use - are evermore central to discussions of digital access in the public sphere, as they provide tools for accessing parts of the Web that have been subject to neglect, removal or state and platform-based forms of content moderation and censorship. In this paper I discuss the cultural significance of web archiving through the example of Tumblr’s 2018 efforts to remove so-called ‘Not Safe for Work’ (NSFW) posts from the platform. The paper examines the archiving of Tumblr NSFW by Archive Team, a self-described ‘loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage’. Findings are presented through the concept of culture which provides a dual lens through which to understand web archiving practices as contingent upon the cultural worlds which they create and operate within. Here, web archiving as culture reveals the ways that practices shape (and are shaped by) online community membership, the nature of how and why the Web is archived and the reflexive significance participants place on their own web archival activities. The paper contributes to broader discussions of online community formation and raises further questions about the ethics and role of power in the production of web archives, as well as their positioning as historical representations of online cultures.
DA  - 2020/10/05/
PY  - 2020
DO  - 10.5210/spir.v2020i0.11294
DP  - www.spir.aoir.org
LA  - en
SN  - 2162-3317
ST  - WEB ARCHIVING AS CULTURE
UR  - https://www.spir.aoir.org/ojs/index.php/spir/article/view/11294
Y2  - 2021/07/15/07:46:03
KW  - web archives
KW  - archival practice
KW  - culture
KW  - NSFW
KW  - Tumblr
ER  - 

TY  - JOUR
TI  - How to Catch a Digital Speed Goat: A Web Archiving Case Study at the University of Wyoming
AU  - Davis, Sara
AU  - Gattermeyer, Rachel
T2  - Provenance, Journal of the Society of Georgia Archivists
DA  - 2021/01/01/
PY  - 2021
VL  - 37
IS  - 1
SN  - 0739-4241
ST  - How to Catch a Digital Speed Goat
UR  - https://digitalcommons.kennesaw.edu/provenance/vol37/iss1/4
L1  - https://digitalcommons.kennesaw.edu/cgi/viewcontent.cgi?article=1534&context=provenance
L2  - https://digitalcommons.kennesaw.edu/provenance/vol37/iss1/4/
L2  - https://digitalcommons.kennesaw.edu/provenance/vol37/iss1/4/
ER  - 

TY  - JOUR
TI  - APPLICATION OF WEB ARCHIVING TECHNOLOGIES IN BNL AND NAB: A PROPOSED MODEL
AU  - Mahmud, Rifat
AU  - Reza, Raiyan Bin
AB  - Web archiving has become a regular activity in many libraries and archival institutions. With the massive spread of internet, preserving the web contents are now being given importance by various countries as the websites contain various important legal, political, educational information. This paper investigates the issues related to web archiving that might be faced by the Bangladesh National Library (BNL) and National Archives of Bangladesh (NAB). The main aim of this paper is to describe current state of web archiving in BNL and NAB. This paper also tried to explore identifying the problems in archiving web contents and proving possible solutions to overcome the problems. Interview method was applied for this study. We interviewed officials from both BNL and NAB and explored relevant literatures to gather information for our work. Web archiving activities are found to be useful in many government libraries and archival institutions around the globe but it is yet to be done in BNL and NAB. The study found that there are many challenges for implementing web archiving in BNL and NAB such as technological difficulties, copyright issues, unskilled manpower, lack of logistical support, etc. should be taken into account while implementing any web archiving programme. Sufficient steps like proper planning, efficient training, logistical support, international cooperation and adequate financial support will help the authorities to establish a successful web archiving programme. Finally, we proposed an intuitive model for NAB and BNL so that it could be considered while taking any web archiving initiative.
DA  - 2020///
PY  - 2020
DP  - Zotero
VL  - 25
SP  - 17
LA  - en
L1  - https://lab.org.bd/wp-content/uploads/2020/12/V25_N2_06_Rifat.pdf
L1  - https://lab.org.bd/wp-content/uploads/2020/12/V25_N2_06_Rifat.pdf
ER  - 

TY  - JOUR
TI  - Citizen Web Archiving: Empowering Undergraduates to Preserve the Internet
AU  - Harris, Kayla
AU  - Shreffler, Stephanie
AU  - Beis, Christina A
T2  - Marian Library Faculty Presentations
DA  - 2021///
PY  - 2021
DP  - Zotero
IS  - 24
SP  - 2
LA  - en
UR  - https://ecommons.udayton.edu/imri_faculty_presentations/24?utm_source=ecommons.udayton.edu%2Fimri_faculty_presentations%2F24&utm_medium=PDF&utm_campaign=PDFCoverPages
Y2  - 2021/08/06/
L1  - https://ecommons.udayton.edu/cgi/viewcontent.cgi?article=1023&context=imri_faculty_presentations
ER  - 

TY  - JOUR
TI  - Mapping of audiences for academic web archiving initiatives
AU  - Martins, Marina Rodrigues
AU  - Rockembach, Moisés
T2  - Intercom: Revista Brasileira de Ciências da Comunicação
AB  - Resumo O estudo apresenta a potencial rede de públicos estratégicos da Universidade Federal do Rio Grande do Sul, visando promover iniciativas de arquivamento da web no âmbito acadêmico. Levou em conta o ambiente relacional projetado a partir dos Órgãos da Administração Superior e do Programa de Pós-Graduação em Comunicação, da Faculdade de Biblioteconomia e Comunicação da Universidade. Como referência, se observou as iniciativas implantadas e as estruturas organizacionais da Universidade de Columbia e da Universidade de Harvard. A metodologia englobou pesquisas bibliográfica, documental e de conteúdo. O entendimento sobre públicos ocorreu a partir dos enfoques da conceituação lógica, do poder e da comunicação. O estudo concluiu que os atores organizacionais exercem influência em diferentes níveis, cada um conforme suas responsabilidades. Quanto maior a quantidade de coleções arquivadas, mais complexas as redes de públicos envolvidos, de seus diferentes sujeitos dependem apoio financeiro, de infraestrutura, tecnológico, jurídico etc.
DA  - 2020/04/27/
PY  - 2020
DO  - 10.1590/1809-5844202014
DP  - SciELO
VL  - 43
SP  - 71
EP  - 88
J2  - Intercom, Rev. Bras. Ciênc. Comun.
LA  - en
SN  - 1809-5844, 1809-5844, 1980-3508
UR  - http://www.scielo.br/j/interc/a/TCW4JZ5Y7PvWfYCYjWkbkbz/abstract/?lang=en
Y2  - 2021/07/15/07:49:23
L1  - http://www.scielo.br/j/interc/a/TCW4JZ5Y7PvWfYCYjWkbkbz/?lang=en&format=pdf
L2  - https://www.scielo.br/j/interc/a/TCW4JZ5Y7PvWfYCYjWkbkbz/abstract/?lang=en
L2  - https://www.scielo.br/j/interc/a/TCW4JZ5Y7PvWfYCYjWkbkbz/abstract/?lang=en
KW  - Web archiving
KW  - Web archive
KW  - Mapping audiences
KW  - Public profile
KW  - Public Relations
ER  - 

TY  - JOUR
TI  - Web archiving: Policy and practice
AU  - Maches, Tori
AU  - Christensen, Marlayna
T2  - Journal of Digital Media Management
AB  - The UC San Diego Library has been collecting and providing access to archived web content since 2007. Initial collections were created on an ad hoc basis, with no highlevel plan to identify websites and content of interest, and there was little documentation of how early collection
decisions were made. As time passed, the library’s web archiving efforts increased in scale, and outgrew this informal approach. Efforts were made to standardise web archiving processes and policies via collection request forms and standardised metadata, eventually culminating in the
creation of a web archive collection development policy, and collection and quality control workflows and tracking. This article outlines the process of creating these tools, including establishing institutional needs and concerns, evaluating the wider landscape of web archiving policies and
norms, and considering sustainable use of available resources. The article also discusses future areas of work to ensure that web content of research and historical interest is captured in full, preserved responsibly, and made accessible even when the original websites have changed or disappeared.
DA  - 2020/01/01/
PY  - 2020
DP  - IngentaConnect
VL  - 8
IS  - 3
SP  - 201
EP  - 214
J2  - Journal of Digital Media Management
ST  - Web archiving
KW  - web archives
KW  - policy
KW  - collection development
KW  - process
KW  - workflows
ER  - 

TY  - JOUR
TI  - Proceedings of the 2020 Web Archiving and Digital Libraries Workshop
AU  - Xie, Zhiwu
AU  - Klein, Martin
AU  - Fox, Edward A.
AB  - "TMVis: Visualizing Webpage Changes Over Time" by Abigail Mabe, Dhruv Patel, Maheedhar Gunnam, Surbhi Shankar, Mat Kelly, Sawood Alam, Michael Nelson, and Michele Weigle 
"125 Databases for the Year 2080" by Kai Naumann 
"SHARI – An Integration of Tools to Visualize the Story of the Day" by Shawn Jones, Alexander Nwala, Martin Klein, Michele Weigle, and Michael Nelson 
"MementoEmbed and Raintale for Web Archive Storytelling" by Shawn Jones, Martin Klein, Michele Weigle, and Michael Nelson 
"Improving the Quality of Web Harvests Using Web Curator Tool" by Ben O'Brien, Andrea Goethals, Jeffrey van der Hoeven, Hanna Koppelaar, Trienka Rohrbach, Steve Knight, Frank Lee, and Charmaine Fajardo
DA  - 2020/08/05/
PY  - 2020
DP  - vtechworks.lib.vt.edu
LA  - en
UR  - https://vtechworks.lib.vt.edu/handle/10919/99569
Y2  - 2021/07/15/07:51:28
L1  - https://vtechworks.lib.vt.edu/bitstream/handle/10919/99569/WADL2020.pdf?sequence=1&isAllowed=y
L1  - https://vtechworks.lib.vt.edu/bitstream/10919/99569/1/WADL2020.pdf
L2  - https://vtechworks.lib.vt.edu/handle/10919/99569
L2  - https://vtechworks.lib.vt.edu/handle/10919/99569
ER  - 

TY  - CHAP
TI  - Quantitative Approaches to the Danish Web Archive
AU  - Nielsen, Janne
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Large-scale historical studies are important for a comprehensive understanding of the historical development of the Web, but such large-scale studies pose certain methodological challenges and require a different level of access to web archives from what is often offered, that is, access through the Wayback Machine. The Danish national web archive Netarkivet and the HPC facility DeiC National Cultural Heritage Cluster, both located at the Royal Danish Library in Denmark, have in recent years opened up new ways to access and process materials from the archive, which allow for quantitative analysis of the archived Danish Web. This chapter includes examples of large-scale studies of different aspects of the Danish Web as it has been archived in Netarkivet. It describes several approaches to creating and analysing large corpora using different types of archived sources for different purposes, such as metadata from crawl.logs to undertake measurement studies and text analyses, hyperlinks to conduct link analyses and HTML documents to search for specific elements in the source code, e.g. identifying requests to third parties to study web tracking. This chapter discusses the methodological challenges related to the use of the archived Web as an object of study in quantitative research and the challenges and benefits of applying computational methods in historical web studies. The archived Danish Web will serve as a case study, but the suggested approaches could be replicated using other web archives, and the use cases aim to highlight how access to different kinds of archived web sources and the use of computational methods allow for new types of studies of the past Web.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 165
EP  - 179
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_13
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - The Problem of Web Ephemera
AU  - Major, Daniela
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - This chapter introduces the problem of web-data transience and its impact on modern societies.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 5
EP  - 10
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_1
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Digital Archaeology in the Web of Links: Reconstructing a Late-1990s Web Sphere
AU  - Webster, Peter
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - One unit of analysis within the archived Web is the “web sphere”, a body of material from different hosts that is related in some meaningful sense. This chapter outlines a method of reconstructing such a web sphere from the late 1990s, that of conservative British Christians as they interacted with each other and with others in the USA in relation to issues of morality, domestic and international politics, law and the prophetic interpretation of world events. Using an iterative method of interrogation of the graph of links for the archived UK Web, it shows the potential for the reconstruction of what I describe as a “soft” web sphere from what is in effect an archive with a finding aid with only classmarks and no descriptions.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 155
EP  - 164
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
ST  - Digital Archaeology in the Web of Links
UR  - https://doi.org/10.1007/978-3-030-63291-5_12
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - National Web Archiving in Australia: Representing the Comprehensive
AU  - Koerbin, Paul
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - National libraries have been at the forefront of web archiving since the activity commenced in the mid-1990s. This effort is built upon and sustained by their long-term strategic focus, curatorial experience and mandate to collect a nation’s documentary heritage. Nevertheless, their specific legal remit, resources and strategic priorities will affect the objectives and the outcomes of national web archiving programmes. The National Library of Australia’s web archiving programme, being among the earliest established and longest sustained activities, provides a case study on the origin and building of a practical approach to comprehensive national collecting and access.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 23
EP  - 32
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
ST  - National Web Archiving in Australia
UR  - https://doi.org/10.1007/978-3-030-63291-5_3
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Full-Text and URL Search Over Web Archives
AU  - Costa, Miguel
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Web archives are a historically valuable source of information. In some respects, web archives are the only record of the evolution of human society in the last two decades. They preserve a mix of personal and collective memories, the importance of which tends to grow as they age. However, the value of web archives depends on their users being able to search and access the information they require in efficient and effective ways. Without the possibility of exploring and exploiting the archived contents, web archives are useless. Web archive access functionalities range from basic browsing to advanced search and analytical services, accessed through user-friendly interfaces. Full-text and URL search have become the predominant and preferred forms of information discovery in web archives, fulfilling user needs and supporting search APIs that feed complex applications. Both full-text and URL search are based on the technology developed for modern web search engines, since the Web is the main resource targeted by both systems. However, while web search engines enable searching over the most recent web snapshot, web archives enable searching over multiple snapshots from the past. This means that web archives have to deal with a temporal dimension that is the cause of new challenges and opportunities, discussed throughout this chapter.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 71
EP  - 84
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_7
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Archiving Social Media: The Case of Twitter
AU  - Pehlivan, Zeynep
AU  - Thièvre, Jérôme
AU  - Drugeon, Thomas
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Around the world, billions of people use social media like Twitter and Facebook every day, to find, discuss and share information. Social media, which has transformed people from content readers to publishers, is not only an important data source for researchers in social science but also a “must archive” object for web archivists for future generations. In recent years, various communities have discussed the need to archive social media and have debated the issues related to its archiving. There are different ways of archiving social media data, including using traditional web crawlers and application programming interfaces (APIs) or purchasing from official company firehoses. It is important to note that the first two methods bring some issues related to capturing the dynamic and volatile nature of social media, in addition to the severe restrictions of APIs. These issues have an impact on the completeness of collections and in some cases return only a sample of the whole. In this chapter, we present these different methods and discuss the challenges in detail, using Twitter as a case study to better understand social media archiving and its challenges, from gathering data to long-term preservation.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 43
EP  - 56
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
ST  - Archiving Social Media
UR  - https://doi.org/10.1007/978-3-030-63291-5_5
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Creating Event-Centric Collections from Web Archives
AU  - Demidova, Elena
AU  - Risse, Thomas
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Web archives are an essential information source for research on historical events. However, the large scale and heterogeneity of web archives make it difficult for researchers to access relevant event-specific materials. In this chapter, we discuss methods for creating event-centric collections from large-scale web archives. These methods are manifold and may require manual curation, adopt search or deploy focused crawling. In this chapter, we focus on the crawl-based methods that identify relevant documents in and across web archives and include link networks as context in the resulting collections.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 57
EP  - 67
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_6
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - A Holistic View on Web Archives
AU  - Holzmann, Helge
AU  - Nejdl, Wolfgang
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - In order to address the requirements of different user groups and use cases of web archives, we have identified three views to access and explore web archives: user-, data- and graph-centric. The user-centric view is the natural way to look at the archived pages in a browser, just like the live web is consumed. By zooming out from there and looking at whole collections in a web archive, data processing methods can enable analysis at scale. In this data-centric view, the web and its dynamics as well as the contents of archived pages can be looked at from two angles: (1) by retrospectively analysing crawl metadata with respect to the size, age and growth of the web and (2) by processing archival collections to build research corpora from web archives. Finally, the third perspective is what we call the graph-centric view, which considers websites, pages or extracted facts as nodes in a graph. Links among pages or the extracted information are represented by edges in the graph. This structural perspective conveys an overview of the holdings and connections among contained resources and information. Only all three views together provide the holistic view that is required to effectively work with web archives.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 85
EP  - 99
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_8
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Linking Twitter Archives with Television Archives
AU  - Pehlivan, Zeynep
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Social media data has already established itself as an important data source for researchers working in a number of different domains. It has also attracted the attention of archiving institutions, many of which have already extended their crawling processes to capture at least some forms of social media data. However, far too little attention has been paid to providing access to this data, which has generally been collected using application programming interfaces (APIs). There is a growing need to contextualize the data gathered from APIs, so that researchers can make informed decisions about how to analyse it, and to develop efficient ways of providing access to it. This chapter will discuss one possible means of providing enhanced access: a new interface developed at the Institut national de l’audiovisuel (INA) that links Twitter and television archives to recreate the phenomenon of the “second screen”, or more precisely the experience of “social television”. The phrase “second screen” describes the increasingly ubiquitous activity of using a second computing device (commonly a mobile phone or tablet) while watching television. If the second device is used to comment on, like or retweet television-related content via social media, this results in the so-called social television. The analysis of this activity, and this data, offers a promising new avenue of research for scholars, especially those based on digital humanities. To the best of our knowledge, the work that will be discussed here is the first attempt at considering how best to recreate the experience of “social television” using archived data.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 127
EP  - 139
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_10
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Web Archives Preserve Our Digital Collective Memory
AU  - Major, Daniela
AU  - Gomes, Daniel
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - This chapter discusses the importance of web archiving, briefly presents its history from the beginning with the Internet Archive in 1996 and exposes the challenges with archiving certain types of online data.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 11
EP  - 19
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_2
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Exploring Online Diasporas: London’s French and Latin American Communities in the UK Web Archive
AU  - Huc-Hepher, Saskia
AU  - Wells, Naomi
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - The aim of the UK Web Archive to collect and preserve the entire UK web domain ensures that it is able to reflect the diversity of voices and communities present on the open Web, including migrant communities who sustain a presence across digital and physical environments. At the same time, patterns of wider social and political exclusion, as well as the use of languages other than English, mean these communities’ web presence is often overlooked in more generic and Anglophone web archiving and (re)searching practices.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 189
EP  - 201
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
ST  - Exploring Online Diasporas
UR  - https://doi.org/10.1007/978-3-030-63291-5_15
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Image Analytics in Web Archives
AU  - Müller-Budack, Eric
AU  - Pustu-Iren, Kader
AU  - Diering, Sebastian
AU  - Springstein, Matthias
AU  - Ewerth, Ralph
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - The multimedia content published on the World Wide Web is constantly growing and contains valuable information in various domains. The Internet Archive initiative has gathered billions of time-versioned web pages since the mid-nineties, but unfortunately, they are rarely provided with appropriate metadata. This lack of structured data limits the exploration of the archives, and automated solutions are required to enable semantic search. While many approaches exploit the textual content of news in the Internet Archive to detect named entities and their relations, visual information is generally disregarded. In this chapter, we present an approach that leverages deep learning techniques for the identification of public personalities in the images of news articles stored in the Internet Archive. In addition, we elaborate on how this approach can be extended to enable detection of other entity types such as locations or events. The approach complements named entity recognition and linking tools for text and allows researchers and analysts to track the media coverage and relations of persons more precisely. We have analysed more than one million images from news articles in the Internet Archive and demonstrated the feasibility of the approach with two use cases in different domains: politics and entertainment.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 141
EP  - 151
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_11
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Critical Web Archive Research
AU  - Ben-David, Anat
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Following the familiar distinction between software and hardware, this chapter argues that web archives deserve to be treated as a third category—memoryware: specific forms of preservation techniques which involve both software and hardware, but also crawlers, bots, curators, and users. While historically the term memoryware refers to the art of cementing together bits and pieces of sentimental objects to commemorate loved ones, understanding web archives as a complex socio-technical memoryware moves beyond their perception as bits and pieces of the live Web. Instead, understanding web archives as memoryware hints at the premise of the web’s exceptionalism in media and communication history and calls for revisiting some of the concepts and best practices in web archiving and web archive research that have consolidated over the years. The chapter, therefore, presents new challenges for web archive research by turning a critical eye on web archiving itself and on the specific types of histories that are constructed with web archives.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 181
EP  - 188
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_14
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - CHAP
TI  - Interoperability for Accessing Versions of Web Resources with the Memento Protocol
AU  - Jones, Shawn M.
AU  - Klein, Martin
AU  - Sompel, Herbert Van de
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - The Internet Archive pioneered web archiving and remains the largest publicly accessible web archive hosting archived copies of web pages (Mementos) going back as far as early 1996. Its holdings have grown steadily since, and it hosts more than 881 billion URIs as of September 2019. However, the landscape of web archiving has changed significantly over the last two decades. Today we can freely access Mementos from more than 20 web archives around the world, operated by for-profit and nonprofit organisations, national libraries and academic institutions, as well as individuals. The resulting diversity improves the odds of the survival of archived records but also requires technical standards to ensure interoperability between archival systems. To date, the Memento Protocol and the WARC file format are the main enablers of interoperability between web archives. We describe a variety of tools and services that leverage the broad adoption of the Memento Protocol and discuss a selection of research efforts that would likely not have been possible without these interoperability standards. In addition, we outline examples of technical specifications that build on the ability of machines to access resource versions on the Web in an automatic, standardised and interoperable manner.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 101
EP  - 126
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_9
Y2  - 2021/07/15/07:52:26
ER  - 

TY  - JOUR
TI  - Archiving Catholic Faith on the Web During the COVID-19 Pandemic
AU  - Harris, Kayla
AU  - Shreffler, Stephanie
DA  - 2021///
PY  - 2021
DP  - Zotero
VL  - 91
IS  - 3
SP  - 7
LA  - en
L1  - https://ecommons.udayton.edu/cgi/viewcontent.cgi?article=1057&context=imri_faculty_publications
ER  - 

TY  - JOUR
TI  - Review Essay: What We Talk about When We Talk about Archiving the Web
AU  - Summers, Ed
T2  - The American Archivist
AB  - What is the Web? Is it the collection of standards, such as hypertext markup language (HTML), hypertext transfer protocol (HTTP), and uniform resource identifiers (URI) that have evolved for the past three decades? Is it various pieces of software such as servers in the cloud, the browser on a laptop, and the apps on a smartphone? Is it all the types of content (text, video, audio) that get linked together and made interactive with JavaScript? Is it the businesses, governments, organizations, and collectives that render behavioral norms, economies of scale, and laws that shape the global distribution of information? Of course, the answer to all these questions (and more) is yes. So, what could it possibly mean to archive the Web?
DA  - 2020/03/01/
PY  - 2020
DO  - 10.17723/0360-9081-83.1.167
DP  - Silverchair
VL  - 83
IS  - 1
SP  - 167
EP  - 175
J2  - The American Archivist
SN  - 0360-9081
ST  - Review Essay
UR  - https://doi.org/10.17723/0360-9081-83.1.167
Y2  - 2021/07/15/08:00:53
L1  - https://meridian.allenpress.com/american-archivist/article-pdf/83/1/167/2553483/0360-9081-83_1_167.pdf
L2  - https://meridian.allenpress.com/american-archivist/article/83/1/167/441152/Review-Essay-What-We-Talk-about-When-We-Talk-about
ER  - 

TY  - JOUR
TI  - The Web Archives Long View
AU  - Schafer, Valerie
AU  - Winters, Jane
DA  - 2021/05/26/
PY  - 2021
DP  - orbilu.uni.lu
LA  - en
UR  - https://orbilu.uni.lu/handle/10993/47287
Y2  - 2021/07/15/08:01:37
L2  - https://orbilu.uni.lu/handle/10993/47287
ER  - 

TY  - RPRT
TI  - Selected Papers of #AoIR2020: The 21st Annual Conference of the Association of Internet Researchers
AU  - Aisdl
AB  - Since its inception, the Association of Internet Researchers (AoIR) has fostered critical reflection on the ethical and social dimensions of the internet and internet-facilitated communication. These ethical foci are clearly evoked throughout the thematics of the AoIR 2020 conference call, beginning with Power, justice, and inequality in digitally mediated lives; Life, sex, and death vis-à-vis social media; and Political life online.
DA  - 2020/12/09/
PY  - 2020
DP  - DOI.org (Crossref)
LA  - en
M3  - preprint
PB  - Open Science Framework
ST  - Selected Papers of #AoIR2020
UR  - https://osf.io/59tm4
Y2  - 2021/07/15/08:02:00
L1  - https://biblio.ugent.be/publication/8676990/file/8676991
ER  - 

TY  - CONF
TI  - Piloting access to the Belgian web-archive for scientific research: a methodological exploration
AU  - Mechant, Peter
AU  - Chambers, Sally
AU  - Vlassenroot, Eveline
AU  - Geeraert, Friedel
T2  - Engaging with Web Archives: 'Opportunities, Challenges and Potentialities' (#EWAVirtual)
AB  - The web is fraught with contradiction. On the one hand, the web has become a central means  of information in everyday life and therefore holds the primary sources of our history created  by a large variety of people (Milligan, 2016; Winters, 2017). Yet, much less importance is  attached to its preservation, meaning that potentially interesting sources for future (humanities)  research are lost. Web archiving therefore is a direct result of the computational turn and has a  role to play in knowledge production and dissemination as demonstrated by a number of  publications (e.g. Brügger & Schroeder, 2017) and research initiatives related to the research  use of web archives (e.g. https://resaw.eu/).  
However, conducting research, and answering research questions based on web archives - in  short; ‘using web archives as a data resource for digital scholars’ (Vlassenroot et al., 2019) - demonstrates that this so-called ‘computational turn’ in humanities and social sciences (i.e. the  increased incorporation of advanced computational research methods and large datasets into  disciplines which have traditionally dealt with considerably more limited collections of  evidence), indeed requires new skills and new software.  
In December 2016, a pilot web-archiving project called PROMISE (PReserving Online  Multiple Information: towards a Belgian StratEgy) was funded. The aim of the project was to  (i) identify current best practices in web-archiving and apply them to the Belgian context, (ii)  pilot Belgian web-archiving, (iii) pilot access (and use) of the pilot Belgian web archive for  scientific research, and (iv) make recommendations for a sustainable web-archiving service for 
Belgium. Now the project is moving towards its final stages, the project team is focusing on  the third objective of the project, namely how pilot access to the Belgian web archive for  scientific research. The aim of this presentation is to discuss how the PROMISE team  approached piloting access to the Belgian web-archive for scientific research, including: a)  reviewing how existing web-archives provide access to their collections for research, b)  assessing the needs of researchers based on a range of initiatives focussing on research-use of  web-archives (e.g. RESAW, BUDDAH, WARCnet, IIPC Research Working Group, etc. and c)  exploring how the five persona’s created as part of the French National Library’s Corpus  project (Moiraghi, 2018) could help us to explore how different types of academic researchers  that might use web archives in their research. Finally, we will introduce the emerging Digital  Research Lab at the Royal Library of Belgium (KBR) as part of a long-term collaboration with  the Ghent Centre for Digital Humanities (GhentCDH) which aims to facilitate data-level access  to KBR’s digitised and born-digital collections and could potentially provide the solution for  offering research access to the Belgian web-archive.
C3  - Engaging with Web Archives : Opportunities, Challenges and Potentialities, #EWAVirtual 2020
DA  - 2020///
PY  - 2020
DP  - biblio.ugent.be
SP  - 27
EP  - 29
LA  - eng
PB  - Maynooth University Arts and Humanities Institute
ST  - Piloting access to the Belgian web-archive for scientific research
UR  - http://hdl.handle.net/1854/LU-8692421
Y2  - 2021/07/15/08:02:26
L1  - https://biblio.ugent.be/publication/8692421/file/8692423.pdf
L2  - https://biblio.ugent.be/publication/8692421
ER  - 

TY  - BOOK
TI  - The Past Web
AU  - Gomes, Daniel
AU  - Demidova, Elena
AU  - Winters, Jane
AU  - Risse, Thomas
DA  - 2021///
PY  - 2021
DP  - Google Scholar
ET  - 1
SP  - 297
LA  - en
PB  - Springer
SN  - 978-3-030-63291-5
UR  - https://www.springer.com/gp/book/9783030632908
Y2  - 2021/08/06/
L2  - https://link.springer.com/book/10.1007%2F978-3-030-63291-5
ER  - 

TY  - CHAP
TI  - The Past Web: A Look into the Future
AU  - Masanès, Julien
AU  - Major, Daniela
AU  - Gomes, Daniel
T2  - The Past Web
DA  - 2021///
PY  - 2021
DP  - Google Scholar
SP  - 285
EP  - 291
PB  - Springer
ST  - The Past Web
L2  - https://link.springer.com/chapter/10.1007/978-3-030-63291-5_22
ER  - 

TY  - JOUR
TI  - Global trends in library web-archives
AU  - Redkina, Natalya S.
T2  - Scientific and Technical Libraries
DA  - 2021///
PY  - 2021
DP  - Google Scholar
IS  - 1
SP  - 100
ER  - 

TY  - CONF
TI  - Requirements and desiderata for the scholarly use of web archives
AU  - Vlassenroot, Eveline
AU  - Chambers, Sally
AU  - Geeraert, Friedel
AU  - Mechant, Peter
C3  - Association of Internet Researchers
DA  - 2020///
PY  - 2020
DP  - Google Scholar
ER  - 

TY  - JOUR
TI  - Quality Matters: A New Approach for Detecting Quality Problems in Web Archives
AU  - Ayala, Brenda Reyes
AU  - McDevitt, Jennifer
AU  - Sun, James
AU  - Liu, Xiaohui
T2  - Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI
DA  - 2020/11/08/
PY  - 2020
DO  - 10.29173/cais1145
DP  - journals.library.ualberta.ca
LA  - en
SN  - 2562-7589
ST  - Quality Matters
UR  - https://journals.library.ualberta.ca/ojs.cais-acsi.ca/index.php/cais-asci/article/view/1145
Y2  - 2021/07/15/08:05:23
L1  - https://journals.library.ualberta.ca/ojs.cais-acsi.ca/index.php/cais-asci/article/download/1145/1004
ER  - 

TY  - JOUR
TI  - When expectations meet reality: common misconceptions about web archives and challenges for scholars
AU  - Ayala, Brenda Reyes
T2  - International Journal of Digital Humanities
AB  - As the study of digital history, politics, and culture emerges as an academic discipline, web archives will play a valuable role as sources of information. Those wishing to engage with web archives will need both specific technical skills and a high-level understanding of how the web works. This paper examines the nature and type of misconceptions that web archivists form when they create and utilise web archives. In order to carry out this research, the author qualitatively analyzed support tickets submitted by web archivists using the Internet Archive’s Archive-It (AIT), the most popular web archiving service. The tickets comprised 2544 interactions between web archivists and AIT support specialists. This paper describes the expectations AIT users bring to web archives, and the differences between their expectations and the realities of the web archiving process. It identifies the most prominent misconceptions AIT users have about both web archives and the web itself, analyses the challenges these misconceptions can pose for researchers, and recommends ways in which these can be addressed.
DA  - 2021/06/12/
PY  - 2021
DO  - 10.1007/s42803-021-00034-3
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
ST  - When expectations meet reality
UR  - https://doi.org/10.1007/s42803-021-00034-3
Y2  - 2021/07/15/08:05:54
ER  - 

TY  - JOUR
TI  - The values of web archives
AU  - Schafer, Valérie
AU  - Winters, Jane
T2  - International Journal of Digital Humanities
AB  - This article considers how the development, promotion and adoption of a set of core values for web archives, linked to principles of “good governance”, will help them to tackle the challenges of sustainability, accountability and inclusiveness that are central to their long-term societal and cultural worth. It outlines the work that has already been done to address these questions, as web archiving begins to move out of its establishment phase, and then discusses seven key principles of good governance that might be adapted by and embedded within web archives: participation, consensus, accountability, transparency, effectiveness and efficiency, inclusivity and legality. The article concludes with a call to action for researchers and archivists to co-create the core values for web archives that will be required if they are to remain a vital part of our cultural heritage infrastructure.
DA  - 2021/06/10/
PY  - 2021
DO  - 10.1007/s42803-021-00037-0
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
UR  - https://doi.org/10.1007/s42803-021-00037-0
Y2  - 2021/07/15/08:06:46
L1  - https://link.springer.com/content/pdf/10.1007%2Fs42803-021-00037-0.pdf
ER  - 

TY  - JOUR
TI  - Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive
AU  - Bingham, Nicola Jayne
AU  - Byrne, Helena
T2  - Big Data & Society
AB  - In this contribution, we will discuss the opportunities and challenges arising from memory institutions' need to redefine their archival strategies for contemporary collecting in a world of big data. We will reflect on this topic by critically examining the case study of the UK Web Archive, which is made up of the six UK Legal Deposit Libraries: the British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries Oxford, Cambridge University Library and Trinity College Dublin. The UK Web Archive aims to archive, preserve and give access to the UK web space. This is achieved through an annual domain crawl, first undertaken in 2013, in addition to more frequent crawls of key websites and specially curated collections which date back as far as 2005. These collections reflect important aspects of British culture and events that shape society. This commentary will explore a number of questions including: what heritage is captured and what heritage is instead neglected by the UK Web archive? What heritage is created in the form of new data and what are its properties? What are the ethical issues that memory institutions face when developing these web archiving practices? What transformations are required to overcome such challenges and what institutional futures can we envisage?
DA  - 2021/00/01/
PY  - 2021
DO  - 10.1177/2053951721990409
DP  - SAGE Journals
VL  - 8
IS  - 1
SP  - 2053951721990409
J2  - Big Data & Society
LA  - en
SN  - 2053-9517
ST  - Archival strategies for contemporary collecting in a world of big data
UR  - https://doi.org/10.1177/2053951721990409
Y2  - 2021/07/15/08:07:11
L1  - https://journals.sagepub.com/doi/pdf/10.1177/2053951721990409
L1  - https://journals.sagepub.com/doi/pdf/10.1177/2053951721990409
KW  - big data
KW  - Web archiving
KW  - legal deposit
KW  - ethics
KW  - heritage
KW  - researcher access
ER  - 

TY  - CONF
TI  - Identifying Documents In-Scope of a Collection from Web Archives
AU  - Patel, Krutarth
AU  - Caragea, Cornelia
AU  - Phillips, Mark E.
AU  - Fox, Nathaniel T.
T3  - JCDL '20
AB  - Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
C1  - New York, NY, USA
C3  - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
DA  - 2020/00/01/
PY  - 2020
DO  - 10.1145/3383583.3398540
DP  - ACM Digital Library
SP  - 167
EP  - 176
PB  - Association for Computing Machinery
SN  - 978-1-4503-7585-6
UR  - https://doi.org/10.1145/3383583.3398540
Y2  - 2021/07/15/
L1  - https://arxiv.org/pdf/2009.00611
L1  - https://dl.acm.org/doi/pdf/10.1145/3383583.3398540
KW  - web archiving
KW  - digital libraries
KW  - text classification
ER  - 

TY  - JOUR
TI  - Go fish: Conceptualising the challenges of engaging national web archives for digital research
AU  - Ogden, Jessica
AU  - Maemura, Emily
T2  - International Journal of Digital Humanities
AB  - Our work considers the sociotechnical and organisational constraints of web archiving in order to understand how these factors and contingencies influence research engagement with national web collections. In this article, we compare and contrast our experiences of undertaking web archival research at two national web archives: the UK Web Archive located at the British Library and the Netarchive at the Royal Danish Library. Based on personal interactions with the collections, interviews with library staff and observations of web archiving activities, we invoke three conceptual devices (orientating, auditing and constructing) to describe common research practices and associated challenges in the context of each national web archive. Through this framework we centre the early stages of the research process that are often only given cursory attention in methodological descriptions of web archival research, to discuss the epistemological entanglements of researcher practices, instruments, tools and methods that create the conditions of possibility for new knowledge and scholarship in this space. In this analysis, we highlight the significant time and energy required on the part of researchers to begin using national web archives, as well as the value of engaging with the curatorial infrastructure that enables web archiving in practice. Focusing an analysis on these research infrastructures facilitates a discussion of how these web archival interfaces both enable and foreclose on particular forms of researcher engagement with the past Web and in turn contributes to critical ongoing debates surrounding the opportunities and constraints of digital sources, methodologies and claims within the Digital Humanities.
DA  - 2021/04/27/
PY  - 2021
DO  - 10.1007/s42803-021-00032-5
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
ST  - ‘Go fish’
UR  - https://doi.org/10.1007/s42803-021-00032-5
Y2  - 2021/07/15/08:10:06
L1  - https://link.springer.com/content/pdf/10.1007%2Fs42803-021-00032-5.pdf
L1  - https://link.springer.com/content/pdf/10.1007%2Fs42803-021-00032-5.pdf
ER  - 

TY  - CONF
TI  - Correspondence as the Primary Measure of Quality for Web Archives: A Grounded Theory Study
AU  - Ayala, Brenda Reyes
A2  - Hall, Mark
A2  - Merčun, Tanja
A2  - Risse, Thomas
A2  - Duchateau, Fabien
T3  - Lecture Notes in Computer Science
AB  - Creating an archived website that is as close as possible to the original, live website remains one of the most difficult challenges in the field of web archiving. Failing to adequately capture a website might mean an incomplete historical record or, worse, no evidence that the site ever even existed. This paper presents a grounded theory of quality for web archives created using data from web archivists. In order to achieve this, I analysed support tickets submitted by clients of the Internet Archive’s Archive-It (AIT), a subscription-based web archiving service that helps organisations build and manage their own web archives. Overall, 305 tickets were analysed, comprising 2544 interactions. The resulting theory is comprised of three dimensions of quality in a web archive: correspondence, relevance, and archivability. The dimension of correspondence, defined as the degree of similarity or resemblance between the original website and the archived website, is the most important facet of quality in web archives, and it is the main focus of this work. This paper’s contribution is that it presents the first theory created specifically for web archives and lays the groundwork for future theoretical developments in the field. Furthermore, the theory is human-centred and grounded in how users and creators of web archives perceive their quality. By clarifying the notion of quality in a web archive, this research will be of benefit to web archivists and cultural heritage institutions.
C1  - Cham
C3  - Digital Libraries for Open Knowledge
DA  - 2020///
PY  - 2020
DO  - 10.1007/978-3-030-54956-5_6
DP  - Springer Link
SP  - 73
EP  - 86
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-54956-5
ST  - Correspondence as the Primary Measure of Quality for Web Archives
L1  - https://link.springer.com/content/pdf/10.1007%2F978-3-030-54956-5_6.pdf
KW  - Web archiving
KW  - Grounded theory
KW  - Information quality
KW  - Quality Assurance
ER  - 

TY  - JOUR
TI  - Web Archiving in the Public Interest from a Data Protection Perspective
AU  - Michel, Alejandra
T2  - Deep diving into data protection: 1979-2019 : celebrating 40 years of research on privacy data protection at the CRIDS
DA  - 2021///
PY  - 2021
DP  - researchportal.unamur.be
SP  - 181
EP  - 200
LA  - English
UR  - https://researchportal.unamur.be/en/publications/web-archiving-in-the-public-interest-from-a-data-protection-persp
Y2  - 2021/07/15/08:12:21
L2  - https://researchportal.unamur.be/en/publications/web-archiving-in-the-public-interest-from-a-data-protection-persp
ER  - 

TY  - ELEC
TI  - Correspondence as the Primary Measure of Quality for Web Archives: A Grounded Theory Study
AU  - Reyes Ayala, Brenda
T2  - ERA
AB  - Creating an archived website that is as close as possible to the original, live website remains one of the most difficult challenges in...
DA  - 2020/06/02/
PY  - 2020
LA  - en
ST  - Correspondence as the Primary Measure of Quality for Web Archives
UR  - https://era.library.ualberta.ca/items/b45b9bf6-424d-4052-85a2-3517c5512cd8
Y2  - 2021/07/15/08:13:28
L1  - https://era.library.ualberta.ca/items/b45b9bf6-424d-4052-85a2-3517c5512cd8/view/5bc8acb3-8a42-4620-b069-7c5c254d4a4c/breyes_tpdl2020_preprint.pdf
L2  - https://era.library.ualberta.ca/items/b45b9bf6-424d-4052-85a2-3517c5512cd8
ER  - 

TY  - JOUR
TI  - Open Challenges for the Management and Preservation of Evolving Data on the Web
AU  - Gleim, Lars
AU  - Decker, Stefan
AB  - As the volume, variety, and velocity of data published on the Web continue to increase, the management, governance and preservation of these data play an increasingly important role. Data-driven decision making and algorithmic control systems rely on the persistent availability of critical information. However, to date, the free sharing, reuse and interoperability of data are hindered by a number of fundamental open challenges for the management and preservation of evolving data on the Web. In this work, we provide an overview of open challenges and recent efforts to address them. We then propose a data persistence layer for data management and preservation, paving the way for increased interoperability and compatibility.
DA  - 2020///
PY  - 2020
DP  - Zotero
SP  - 7
LA  - en
UR  - http://ceur-ws.org/Vol-2821/paper9.pdf
L1  - http://ceur-ws.org/Vol-2821/paper9.pdf
ER  - 

TY  - CONF
TI  - Archiving Interactive Narratives at the British Library
AU  - Clark, Lynda
AU  - Rossi, Giulia Carla
AU  - Wisdom, Stella
A2  - Bosser, Anne-Gwenn
A2  - Millard, David E.
A2  - Hargood, Charlie
T3  - Lecture Notes in Computer Science
AB  - This paper describes the creation of the Interactive Narratives collection in the UK Web Archive, as part of the UK Legal Deposit Libraries Emerging Formats Project. The aim of the project is to identify, collect and preserve complex digital publications that are in scope for collection under UK Non-Print Legal Deposit Regulations. This article traces the process of building the Interactive Narratives collection, analysing the different tools and methods used and placing the collection within the wider context of Emerging Formats work and engagement activities at the British Library.
C1  - Cham
C3  - Interactive Storytelling
DA  - 2020///
PY  - 2020
DO  - 10.1007/978-3-030-62516-0_27
DP  - Springer Link
SP  - 300
EP  - 313
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-62516-0
L1  - https://discovery.dundee.ac.uk/ws/files/52607687/Archiving_Interactive_Narratives_at_the_British_Library.pdf
L1  - https://link.springer.com/content/pdf/10.1007%2F978-3-030-62516-0_27.pdf
KW  - Web archiving
KW  - Digital preservation
KW  - Digital storytelling
KW  - Emerging Formats
KW  - Interactive Narratives collection
KW  - New media collection management
ER  - 

TY  - BOOK
TI  - Electronic Legal Deposit: Shaping the library collections of the future
AU  - Gooding, Paul
AU  - Terras, Melissa
AB  - Legal deposit libraries, the national and academic institutions who systematically preserve our written cultural record, have recently been mandated with expanding their collection practices to include digitised and born-digital materials. The regulations that govern electronic legal deposit often also prescribe how these materials can be accessed. Although a growing international activity, there has been little consideration of the impact of e-legal deposit on the 21st Century library, or on its present or future users.This edited collection is a timely opportunity to bring together international authorities who are placed to explore the social, institutional and user impacts of e-legal deposit. It uniquely provides a thorough overview of this worldwide issue at an important juncture in the history of library collections in our changing information landscape, drawing on evidence gathered from real-world case studies produced in collaboration with leading libraries, researchers and practitioners (Biblioteca Nacional de México, Bodleian Libraries, British Library, National Archives of Zimbabwe, National Library of Scotland, National Library of Sweden). Chapters consider the viewpoint of a variety of stakeholders, including library users, researchers, and publishers, and provide overviews of the complex digital preservation and access issues that surround e-legal deposit materials, such as web archives and interactive media.The book will be essential reading for practitioners and researchers in national and research libraries, those developing digital library infrastructures, and potential users of these collections, but also those interested in the long-term implications of how our digital collections are conceived, regulated and used. Electronic legal deposit is shaping our digital library collections, but also their future use, and this volume provides a rigorous account of its implementation and impact.
DA  - 2020/10/02/
PY  - 2020
DP  - Google Books
SP  - 272
LA  - en
PB  - Facet Publishing
SN  - 978-1-78330-377-9
ST  - Electronic Legal Deposit
L2  - https://books.google.hu/books?id=7HoUEAAAQBAJ
KW  - Language Arts & Disciplines / Library & Information Science / Digital & Online Resources
ER  - 

TY  - CONF
TI  - The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle
AU  - Wang, Xinyue
AU  - Xie, Zhiwu
T3  - JCDL '20
AB  - The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.
C1  - New York, NY, USA
C3  - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
DA  - 2020/00/01/
PY  - 2020
DO  - 10.1145/3383583.3398542
DP  - ACM Digital Library
SP  - 177
EP  - 186
PB  - Association for Computing Machinery
SN  - 978-1-4503-7585-6
UR  - https://doi.org/10.1145/3383583.3398542
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3383583.3398542
L1  - https://vtechworks.lib.vt.edu/bitstream/10919/98565/1/fp210.pdf
KW  - web archiving
KW  - big data analysis
KW  - file format
KW  - storage management
ER  - 

TY  - ELEC
TI  - Project Muse - Appraisal Talk in Web Archives
DA  - 2021/07/15/08:17:22
PY  - 2021
UR  - https://muse.jhu.edu/article/755769/summary
Y2  - 2021/07/15/08:17:22
L2  - https://muse.jhu.edu/article/755769/summary
ER  - 

TY  - JOUR
TI  - Everything on the Internet can be saved: Archive Team and the death/resurrection of Tumblr NSFW
AU  - Ogden, Jessica
T2  - Internet Histories
DA  - 2020/10/03/
PY  - 2020
DP  - research-information.bris.ac.uk
LA  - English
SN  - 2470-1483
ST  - ‘Everything on the Internet can be saved’
UR  - https://research-information.bris.ac.uk/en/publications/everything-on-the-internet-can-be-saved-archive-team-and-the-deat
Y2  - 2021/07/15/08:17:54
L2  - https://research-information.bris.ac.uk/en/publications/everything-on-the-internet-can-be-saved-archive-team-and-the-deat
ER  - 

TY  - JOUR
TI  - Increasing Access to Web Archives: Archive-It and the Discovery Layer
AU  - Beis, Christina A
AU  - Harris, Kayla
AU  - Shreffler, Stephanie
T2  - MAC Newsletter
DA  - 2020///
PY  - 2020
DP  - Zotero
VL  - 47
IS  - 4
LA  - en
L1  - https://ecommons.udayton.edu/cgi/viewcontent.cgi?article=1051&context=imri_faculty_publications
ER  - 

TY  - JOUR
TI  - History in the Age of Abundance? How the Web is Transforming Historical Research. Ian Milligan.
AU  - Dillon, Lisa
T2  - Canadian Historical Review
DA  - 2021/03/01/
PY  - 2021
DO  - 10.3138/chr.102.1.br16
DP  - utpjournals.press (Atypon)
VL  - 102
IS  - 1
SP  - 202
EP  - 204
SN  - 0008-3755
ST  - History in the Age of Abundance?
UR  - https://www.utpjournals.press/doi/abs/10.3138/chr.102.1.br16
Y2  - 2021/07/15/08:19:26
ER  - 

TY  - CONF
TI  - The ELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata
AU  - Indig, Balázs
AU  - Knap, Árpád
AU  - Sárközi-Lindner, Zsófia
AU  - Timári, Mária
AU  - Palkó, Gábor
AB  - In this article, we present the method we used to create a middle-sized corpus using targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies rule-based components that allow the curation of the text and the metadata content. The curated data can thereon serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and customisation. Our concept can also be applied to other genres of portals by using the discovered patterns in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal manpower and computational resources. Reproducing the corpus is possible if changes are introduced to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used for the output format, but we stress that placing the corpus in a digital repository system is recommended in order to be able to define semantic relations between the segments and to add rich annotation.
C1  - Marseille, France
C3  - Proceedings of the 12th Web as Corpus Workshop
DA  - 2020/05//
PY  - 2020
DP  - ACLWeb
SP  - 33
EP  - 41
LA  - English
PB  - European Language Resources Association
SN  - 979-10-95546-68-9
UR  - https://aclanthology.org/2020.wac-1.5
Y2  - 2021/07/15/08:20:56
L1  - https://aclanthology.org/2020.wac-1.5.pdf
L1  - https://aclanthology.org/2020.wac-1.5.pdf
L1  - https://aclanthology.org/2020.wac-1.5.pdf
ER  - 

TY  - JOUR
TI  - Supporting Sustainable Digital Humanities Projects: Managing the Lifecycle of Student-Created Web Content from Inception to Archiving
AU  - Walton, Rachel
AU  - Sugar, Amy
T2  - Digital Initiatives Symposium
DA  - 2021/04/29/
PY  - 2021
ST  - Supporting Sustainable Digital Humanities Projects
UR  - https://digital.sandiego.edu/symposium/2021/2021/23
L2  - https://digital.sandiego.edu/symposium/2021/2021/23/
ER  - 

TY  - CONF
TI  - A Semantic Layer Querying Tool
AU  - Stoffalette João, Renato
T3  - WSDM '21
AB  - Web archiving is the process of gathering data from the Web, storing it and ensuring the data is preserved in an archive for future explorations. Despite the increasing number of web archives, the absence of meaningful exploration methods remains a major hurdle in the way of turning them into a useful information source. With the creation of profiles describing metadata information about the archived documents it is possible to offer a more exploitable environment that goes beyond the simple keyword-based search. By exploring the expressive power of SPARQL language and providing a user friendly web-based search interface, users can run sophisticated queries searching for documents that meet their information needs.
C1  - New York, NY, USA
C3  - Proceedings of the 14th ACM International Conference on Web Search and Data Mining
DA  - 2021/03/08/
PY  - 2021
DO  - 10.1145/3437963.3441710
DP  - ACM Digital Library
SP  - 1101
EP  - 1104
PB  - Association for Computing Machinery
SN  - 978-1-4503-8297-7
UR  - https://doi.org/10.1145/3437963.3441710
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3437963.3441710
KW  - web archives
KW  - semantic layers
KW  - information retrieval
KW  - SPARQL
ER  - 

TY  - JOUR
TI  - The weaponization of web archives: Data craft and COVID-19 publics
AU  - Acker, Amelia
AU  - Chaiet, Mitch
T2  - Harvard Kennedy School Misinformation Review
AB  - An unprecedented volume of harmful health misinformation linked to the coronavirus pandemic has led to the appearance of misinformation tactics that leverage web archives in order to evade content moderation on social media platforms. Here we present newly identified manipulation techniques designed to maximize the value, longevity, and spread of harmful and non-factual content across social media using provenance information from web archives and social media analytics. After identifying conspiracy content that has been archived by human actors with the Wayback Machine, we report on user patterns of “screensampling,” where images of archived misinformation are spread via social platforms. We argue that archived web resources from the Internet Archive’s Wayback Machine and subsequent screenshots contribute to the COVID-19 “misinfodemic” in platforms. Understanding these manipulation tactics that use sources from web archives reveals something vexing about information practices during pandemics—the desire to access reliable information even after it has been moderated and fact-checked, for some individuals, will give health misinformation and conspiracy theories more traction because it has been labeled as specious content by platforms.
DA  - 2020/09/27/
PY  - 2020
DO  - 10.37016/mr-2020-41
DP  - DOI.org (Crossref)
J2  - HKS Misinfo Review
LA  - en
ST  - The weaponization of web archives
UR  - https://misinforeview.hks.harvard.edu/?p=2923
Y2  - 2021/07/15/08:27:57
L1  - https://repositories.lib.utexas.edu/bitstream/handle/2152/83188/The%20weaponization%20of%20web%20archives_%20Data%20craft%20and%20COVID-19%20publics%20_%20HKS%20Misinformation%20Review.pdf?sequence=2
ER  - 

TY  - RPRT
TI  - Book of Abstracts: #EWAVirtual 2020
AU  - #EWA Conference Organisers
AB  - <strong>Engaging with Web Archives</strong>: ‘Opportunities, Challenges and Potentialities’, (#EWAVirtual), 21-22 September 2020, Maynooth University Arts and Humanities Institute, Co. Kildare, Ireland. The first international Engaging with Web Archives conference sought to: raise awareness for the use of web archives and the archived web for research and education across a broad range of disciplines and professions in the Arts, Humanities, Social Sciences, Political Science, Media Studies, Information Science, Computer Science and more; foster collaborations between web archiving initiatives, researchers, educators and IT professionals; highlight how the development of the internet and the web is intricately linked to the history of the 1990s. This is a Book of Abstracts from the two-day virtual conference, which took place in September 2020 after the original physical conference in April 2020 was postponed due to COVID-19.
DA  - 2020/10/07/
PY  - 2020
DP  - DOI.org (Datacite)
LA  - en
PB  - Zenodo
ST  - Book of Abstracts
UR  - https://zenodo.org/record/4058013
Y2  - 2021/07/15/08:28:18
L1  - https://biblio.ugent.be/publication/8692421/file/8692423#page=70
L1  - https://biblio.ugent.be/publication/8692421/file/8692423.pdf#page=66
KW  - web archiving
KW  - web archives
KW  - archived web
KW  - research engagement
KW  - research of web archives
KW  - research with web archives
ER  - 

TY  - CONF
TI  - How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web
AU  - Paris, Michael
AU  - Jäschke, Robert
T3  - HT '20
AB  - Longitudinal web archives can be a foundation for investigating structural and content-based research questions. One prerequisite is that they contain a faithful representation of the relevant subset of the web. Therefore, an assessment of the authority of a given dataset with respect to a research question should precede the actual investigation. Next to proper creation and curation, this requires measures for estimating the potential of a longitudinal web archive to yield information about the central objects the research question aims to investigate. In particular, content-based research questions often lack the ab-initio confidence about the integrity of the data. In this paper we focus on one specifically important aspect, namely the exhaustiveness of the dataset with respect to the central objects. Therefore, we investigate the recall coverage of researcher names in a longitudinal academic web crawl over a seven year period and the influence of our crawl method on the dataset integrity. Additionally, we propose a method to estimate the amount of missing information as a means to describe the exhaustiveness of the crawl and motivate a use case for the presented corpus.
C1  - New York, NY, USA
C3  - Proceedings of the 31st ACM Conference on Hypertext and Social Media
DA  - 2020/07/13/
PY  - 2020
DO  - 10.1145/3372923.3404836
DP  - ACM Digital Library
SP  - 85
EP  - 89
PB  - Association for Computing Machinery
SN  - 978-1-4503-7098-1
ST  - How to Assess the Exhaustiveness of Longitudinal Web Archives
UR  - https://doi.org/10.1145/3372923.3404836
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3372923.3404836
KW  - web archive
KW  - dataset
KW  - exhaustive
KW  - focused web crawl
KW  - longitudinal
ER  - 

TY  - JOUR
TI  - Labor Gone Digital (DigiFacket)! Experiences from Creating a Web Archive for Swedish Trade Unions
AU  - Jansson, Jenny
AU  - Uba, Katrin
AU  - Karo, Jaanus
T2  - Journal of Contemporary Archival Studies
DA  - 2020/11/20/
PY  - 2020
VL  - 7
IS  - 1
SN  - 2380-8845
UR  - https://elischolar.library.yale.edu/jcas/vol7/iss1/19
L1  - https://elischolar.library.yale.edu/cgi/viewcontent.cgi?article=1128&context=jcas
L2  - https://elischolar.library.yale.edu/jcas/vol7/iss1/19/
ER  - 

TY  - SLIDE
TI  - Extracting Online Publications Embedded in Websites: NDL Initiatives and Challenges
T2  - IFLA WLIC 2020
A2  - Nobuaki, INOIE
A2  - Masaki, SHIBATA
A2  - Tetsuro, KUDO
AB  - The National Diet Library (NDL) has been operating the Web ARchiving Project (WARP) since 2002 and steadily archiving Japanese websites. However, it is often difficult for users to find e-books, ezines and other online publications embedded in websites, because they are stored as a part of websites and do not have sufficient metadata.
CY  - Dublin
DA  - 2020///
PY  - 2020
LA  - en
UR  - https://origin-www.ifla.org/files/assets/information-technology/Webinars/ifla_professional_units_virtual_events_-_inoie-en.pdf
Y2  - 2021/08/06/
L1  - https://origin-www.ifla.org/files/assets/information-technology/Webinars/ifla_professional_units_virtual_events_-_inoie-en.pdf
ER  - 

TY  - JOUR
TI  - Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience
AU  - Fritz, Samantha
AU  - Milligan, Ian
AU  - Ruest, Nick
AU  - Lin, Jimmy
AB  - This article explores the impact that a series of Archives Unleashed datathon events have had on community engagement both within the web archiving field, and more specifically, on the professional practices of attendees. We present results from surveyed datathon participants, in addition to related evidence from our events, to discuss how our participants saw the datathons as dramatically impacting both their professional practices as well as the broader web archiving community. Drawing on and adapting two leading community engagement models, we combine them to introduce a new understanding of how to build and engage users in an open-source digital humanities project. Our model illustrates both the activities undertaken by our project as well as the related impact they have on the field. The model can be broadly applied to other digital humanities projects seeking to engage their communities.
DA  - 2021///
PY  - 2021
DP  - Zotero
SP  - 14
LA  - en
UR  - http://hdl.handle.net/10315/38257
L1  - https://yorkspace.library.yorku.ca/xmlui/bitstream/handle/10315/38257/DHQ_%20Digital%20Humanities%20Quarterly_%20Fostering%20Community%20Engagement%20through%20Datathon%20Events_%20The%20Archives%20Unleashed%20Experience.pdf?sequence=1&isAllowed=y
ER  - 

TY  - JOUR
TI  - Reviewing football history through the UK Web Archive
AU  - Byrne, Helena
T2  - Soccer & Society
AB  - The UK Web Archive aims to archive, preserve and give access to the UK webspace. This aim is achieved through an annual domain crawl, in addition to frequent crawls of selected websites and specially curated collections. These collections reflect important aspects of British culture and events that shape society. Sport and in particular football, make up a large section of the UK webspace. For this reason, the UK Web Archive started a curated collection on these subjects in autumn 2017. This article acts as a guide to the collections, using football as an example of the type of research that can be done, rather than a critical analysis of contemporary collection policies or the examples highlighted. It is hoped that this article will increase the use of the UK Web Archive and lead to a further review of how the sports studies field adapts as well as engages with digital resources.
DA  - 2020/05/18/
PY  - 2020
DO  - 10.1080/14660970.2020.1751474
DP  - Taylor and Francis+NEJM
VL  - 21
IS  - 4
SP  - 461
EP  - 474
SN  - 1466-0970
UR  - https://doi.org/10.1080/14660970.2020.1751474
Y2  - 2021/07/15/08:31:06
L2  - https://www.tandfonline.com/doi/abs/10.1080/14660970.2020.1751474
ER  - 

TY  - CONF
TI  - The Neil deGrasse Tyson Problem: Methods for Exploring Base Memes in Web Archives
AU  - Acker, Amelia
AU  - C. Loos, Anne
AU  - Sufrin, Julia
T3  - SMSociety'20
AB  - In this paper we introduce the concept of the “base meme” for characterizing unique information artifacts that are used to make derivative, new, and related memes. Base memes are antecedents to many versions of derivative memes that are published all across the web. While they can be created in meme template generator websites, their origins and diffusion can be difficult for researchers to verify. Despite the often ephemeral nature of memes that are shared via platforms, they can be fairly reliably found in web archive collections, such as the Internet Archive and the US Library of Congress’ Web Cultures Web Archive. In this paper, we first present the existing research on memes and discuss the challenges for researchers who study them (such as identification and language detection). We then describe the importance of web archives to social media research and building robust methods of inquiry for internet history. Using archived data from the Library of Congress’ Meme Generator Archive (N=57,652), we use descriptive analysis to calculate, measure, and describe this important public web archive of memes. Our results show that this collection has a variety of “base memes” that can be grouped with their related derivative memes (which we consider to be their related works). We use language detection software to identify a variety of languages present in the archived dataset of memes. We close by describing why approaching these metrics on “base meme” image macros alongside findings for derivative versions and the multiple languages present in web archives of social media allows researchers to study a diversity of voices, including linguistic diversity, distinctions in humor, and the variety of cultural expressions present in memes.
C1  - New York, NY, USA
C3  - International Conference on Social Media and Society
DA  - 2020/07/22/
PY  - 2020
DO  - 10.1145/3400806.3400836
DP  - ACM Digital Library
SP  - 255
EP  - 264
PB  - Association for Computing Machinery
SN  - 978-1-4503-7688-4
ST  - The Neil deGrasse Tyson Problem
UR  - https://doi.org/10.1145/3400806.3400836
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3400806.3400836
KW  - web archives
KW  - Base memes
KW  - language detection
KW  - Meme Generator
ER  - 

TY  - BOOK
TI  - Going Back in Time to Find What Existed on the Web and How much has been Preserved: How much of Palestinian Web has been Archived?
AU  - Samar, Thaer
AU  - Khalilia, Hadi
AB  - The web is an important resource for publishing and sharing content. The main characteristic of the web is its volatility. Content is added, updated, and deleted all the time. Therefore, many national and international institutes started crawling and archiving the content of the web. The main focus of national institutes is to archive the web related to their country heritage, for example, the National Library of the Netherlands is focusing on archiving website that are of value to the Dutch heritage. However, there are still countries that haven’t taken the action to archive their web, which will result in loosing and having a gap in the knowledge. In this research, we focus on shedding the light on the Palestinian web. Precisely, how much of the Palestinian web has been archived. First, we create a list of Palestinian hosts that were on the web. For that we queried Google index exploiting the time range filter in order to get hosts overtime. We collected in 98 hosts in average in 5-years granularity from the year 1990 to 2019. We also obtained Palestinian hosts from the DMOZ directory. We collected 188 hosts. Second, we investigate the coverage of collected hosts in the Internet Archive and the Common-Crawl. We found that coverage of Google hosts in the Internet Archive ranges from 0% to 89% from oldest to newest time-granularity. The coverage of DMOZ hosts was 96%. The coverage of Google hosts in the Common-Crawl 57.1% to 74.3, while the coverage of DMOZ hosts in the Common-Crawl was in average 25% in all crawls. We found that even the host is covered in Internet Archive and Common-Crawl, the lifespan and the number of archived versions are low.
DA  - 2021/00/09/
PY  - 2021
DP  - ResearchGate
ST  - Going Back in Time to Find What Existed on the Web and How much has been Preserved
L4  - https://www.researchgate.net/profile/Hadi-Khalilia/publication/348673290_Going_Back_in_Time_to_Find_What_Existed_on_the_Web_and_How_much_has_been_Preserved_How_much_of_Palestinian_Web_has_been_Archived/links/6034a6c84585158939c27f66/Going-Back-in-Time-to-Find-What-Existed-on-the-Web-and-How-much-has-been-Preserved-How-much-of-Palestinian-Web-has-been-Archived.pdf
ER  - 

TY  - JOUR
TI  - Exploring the 20-year evolution of a research community: web-archives as essential sources for historical research
AU  - Brügger, Niels
AU  - Schafer, Valerie
AU  - Geeraert, Friedel
AU  - Isbergue, Nadège
AU  - Chambers, Sally
T2  - Cahiers de la documentation
DA  - 2020/07//
PY  - 2020
DP  - orbilu.uni.lu
VL  - 2
LA  - en
SN  - 0007-9804
ST  - Exploring the 20-year evolution of a research community
UR  - https://orbilu.uni.lu/handle/10993/43903
Y2  - 2021/07/15/08:39:51
L1  - https://biblio.ugent.be/publication/8671508/file/8671510
L2  - https://orbilu.uni.lu/handle/10993/43903
ER  - 

TY  - JOUR
TI  - Community History in Minnesota during a Pandemic: What Comes Next?
AU  - Smith, Adam
AU  - Mixon, Daardi Sizemore
AU  - Desens, Rebecca Ebnet
AU  - Jacobs, Jenna
T2  - MAC Annual Meeting Presentations
AB  - <p>Three Minnesota cultural heritage organizations developed distinctly different community history projects to document the COVID-19 Pandemic. Anoka County Historical Society distributed monthly surveys asking questions relevant to the community at the time while encouraging the public to submit documentation for the archives.  Hennepin County Library rapidly expanded its nascent web archiving program to capture websites of Minneapolis and suburban community organizations affected by and responding to the pandemic. Minnesota State University, Mankato developed a community history project that incorporated the international student experience to explore how our students and their families responded to the pandemic throughout the summer.</p><p>This presentation will discuss the logistics of how they organized and conducted their community history projects and next steps for those collections.  Presenters will discuss processing primarily born digital materials and making the collections available for researchers while navigating privacy issues to protect contributors. Each of these projects has spawned innovative thinking along with contributing to new directions and partnerships for the organizations including an emphasis on social justice initiatives.</p>
DA  - 2021/05/14/03:00
PY  - 2021
DP  - www.iastatedigitalpress.com
VL  - 2021
IS  - 1
LA  - eng
ST  - Community History in Minnesota during a Pandemic
UR  - https://www.iastatedigitalpress.com/macmeetings/article/id/12571/
Y2  - 2021/07/15/08:41:25
L2  - https://www.iastatedigitalpress.com/macmeetings/article/id/12571/print/
ER  - 

TY  - JOUR
TI  - Bit Rosie: A Case Study in Transforming Web-Based Multimedia Research into Digital Archives
AU  - Fournet, Adele
T2  - The American Archivist
AB  - This article is a case study in transforming web-based multimedia research initiatives into digital institutional archives to safeguard against the unstable nature of the Internet as a long-term historical medium. The study examines the Bit Rosie digital archives at the New York University Fales Library, which was created as a collaboration between a doctoral researcher in ethnomusicology and the head music librarian at the Avery Fisher Center for Music and Media. The article analyzes how the Bit Rosie archives implements elements of both feminist and activist archival practice in a born-digital context to integrate overlooked women music producers into the archives of the recorded music industry. The case study illustrates how collaboration between cultural creators, researchers, and archivists can give legitimacy and longevity to projects and voices of cultural resistance in the internet era. To conclude, the article suggests that more researchers and university libraries can use this case study as a model in setting up institutional archival homes for the increasing number of multimedia initiatives and projects blossoming throughout the humanities and social sciences.
DA  - 2021/06/24/
PY  - 2021
DO  - 10.17723/0360-9081-84.1.119
DP  - Silverchair
VL  - 84
IS  - 1
SP  - 119
EP  - 138
J2  - The American Archivist
SN  - 0360-9081
ST  - Bit Rosie
UR  - https://doi.org/10.17723/0360-9081-84.1.119
Y2  - 2021/07/15/08:42:45
L2  - https://meridian.allenpress.com/american-archivist/article-abstract/84/1/119/466998/Bit-Rosie-A-Case-Study-in-Transforming-Web-Based
ER  - 

TY  - JOUR
TI  - From archive to analysis: accessing web archives at scale through a cloud-based interface
AU  - Ruest, Nick
AU  - Fritz, Samantha
AU  - Deschamps, Ryan
AU  - Lin, Jimmy
AU  - Milligan, Ian
T2  - International Journal of Digital Humanities
AB  - This paper introduces the Archives Unleashed Cloud, a web-based interface for working with web archives at scale. Current access paradigms, largely driven by the scope and scale of web archives, generally involve using the command line and writing code. This access gap means that subject-matter experts, as opposed to developers and programmers, have few options to directly work with web archives beyond the page-by-page paradigm of the Wayback Machine. Drawing on first-hand research and analysis of how scholars use web archives, we present the interface design and underpinning architecture of the Archives Unleashed Cloud. We also discuss the sustainability implications of providing a cloud-based service for researchers to analyze their collections at scale.
DA  - 2021/01/06/
PY  - 2021
DO  - 10.1007/s42803-020-00029-6
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
ST  - From archive to analysis
UR  - https://doi.org/10.1007/s42803-020-00029-6
Y2  - 2021/07/15/08:43:31
L1  - https://link.springer.com/content/pdf/10.1007%2Fs42803-020-00029-6.pdf
ER  - 

TY  - JOUR
TI  - Empirical Research on Web Harvesting in the Process of Text and Data Mining in National Libraries of EU Member States
AU  - Papadopoulos, Marinos
AU  - Botti, Maria
AU  - Ganatsiou, M. A. Paraskevi (Vicky)
AU  - Zampakolas, Christos
T2  - Open Journal of Philosophy
AB  - Almost two decades of experience on web harvesting and archiving are counted; the subject of web harvesting and web archiving have been top in the interest of researchers, technologists and librarians-information scientists. Web harvesting projects and pilot programs on archiving content traced on the Web are becoming priorities for national libraries and cultural heritage organizations in the EU. This paper pertains to web harvesting as a process for data mining from web and only through web (“pull” function); this paper elaborates upon research implemented in the framework of the funded research project titled “Web Archiving in Public Libraries and IP Law” that focused on the processes of web-harvesting and archiving as well as Text and Data Mining (TDM) operations in the national libraries of EU Member States. Web archiving as an official operation in national libraries of EU Member States creates web collections and preserves them for the purpose of being accessible and usable in perpetuity. This paper pertains to research on various components of web harvesting and archiving through an online survey (qualitative research) which targeted the national libraries of EU Member States. The research team of authors posed seventeen questions to EU national libraries. The survey output comes from answers delivered by 22 national libraries of EU Member States. The questionnaire was created through the use of Google forms. The researchers reached the EU national libraries via email and follow up telephone calls seeking libraries’ participation in the research. The aim of the research was to delve on participant libraries’ Text and Data Mining operation leveraging on Web harvesting and Web archiving technologies and operations. Results analysis reveals that web harvesting is considered among national libraries’ top priorities; the relevant projects increase in number, the web collections become more and more and the technological infrastructures and tools for web harvesting improve. Yet, there are many issues that remain unresolved. A significant number of surveyed libraries consider that legal and technical issues remain the most important to resolve. Access to harvested material is still under legal restrictions. The Directive 2019/790/EU on Copyright in the Digital Single Market (DSM) creates a favorable legal foundation for the deployment of web harvesting operations in national libraries of the EU Member States. TDM technologies make possible new areas of research. Web harvesting that was initially aimed for preservation purposes now expands to unprecedented research of national heritage through state-of-the-art automated TDM processes.
DA  - 2020/02/07/
PY  - 2020
DO  - 10.4236/ojpp.2020.101007
DP  - www.scirp.org
VL  - 10
IS  - 01
SP  - 88
LA  - en
UR  - http://www.scirp.org/journal/Paperabs.aspx?PaperID=98160
Y2  - 2021/07/15/08:44:27
L1  - http://www.scirp.org/journal/PaperDownload.aspx?paperID=98160
L2  - https://www.scirp.org/html/7-1651065_98160.htm
ER  - 

TY  - CHAP
TI  - Giving with one click, taking with the other: e-legal deposit, web archives and researcher access
AU  - Winters, Jane
T2  - Electronic Legal Deposit: Shaping the library collections of the future
CY  - London
DA  - 2020///
PY  - 2020
DP  - Zotero
ET  - 1.
SP  - 159
EP  - 178
LA  - en
PB  - Facet Publishing
SN  - 978-1-78330-377-9
UR  - https://sas-space.sas.ac.uk/9439/1/Giving%20with%20one%20click%2C%20taking%20with%20the%20other.pdf
Y2  - 2021/08/06/
L1  - https://sas-space.sas.ac.uk/9439/1/Giving%20with%20one%20click%2C%20taking%20with%20the%20other.pdf
ER  - 

TY  - CONF
TI  - The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives
AU  - Ruest, Nick
AU  - Lin, Jimmy
AU  - Milligan, Ian
AU  - Fritz, Samantha
T3  - JCDL '20
AB  - The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building---all proceeding concurrently in mutually-reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.
C1  - New York, NY, USA
C3  - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
DA  - 2020/00/01/
PY  - 2020
DO  - 10.1145/3383583.3398513
DP  - ACM Digital Library
SP  - 157
EP  - 166
PB  - Association for Computing Machinery
SN  - 978-1-4503-7585-6
ST  - The Archives Unleashed Project
UR  - https://doi.org/10.1145/3383583.3398513
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3383583.3398513
L1  - https://yorkspace.library.yorku.ca/xmlui/bitstream/10315/37506/1/autk.pdf
KW  - apache spark
KW  - cloud platform
KW  - technology adoption
ER  - 

TY  - BOOK
TI  - Technological arts preservation
AU  - Artut, Selçuk
AU  - Artut, Selcuk
AU  - Karaman, Osman Serhat
AU  - Yılmaz, Cemal
AU  - Yilmaz, Cemal
AB  - It is with an increasing tendency that works of art produced with the help of technology (such as video, sound, image, code, virtual reality, augmented reality, kinetic, digital and physical hybridities), or that require technology to work (such as hardware or software) are being included into various art collections. How these artworks would be carried into the future considering the rapidly advancing technology becomes a conundrum for all cultural institutions responsible for conserving cultural heritage. As a response to these needs, The Technological Arts Preservation Project has come into existence with the cooperation of Sakıp Sabancı Museum and Sabancı University. The project was initiated on May 23, 2019 when Osman Serhat Karaman, Sakıp Sabancı Museum digitalSSM Archive and Research Space Manager, invited Selçuk Artut, faculty member of Sabancı University Visual Arts and Visual Communication Design Program to give a speech on the issue.

The Technological Arts Preservation Project aims at cooperation and information-sharing between professionals from various disciplines and areas of expertise. Scholars, media theorists, researchers, digital art conservators, curators and artists, software engineers, and computer scientists from significant institutions such as INA (Institut national de l'audiovisuel), Rhizome, Tate Modern, ZKM have contributed to the research project that gained international status. Within the scope of the project between November 15, 2019 and November 20, 2020, nine conferences and a workshop have been conducted on the preservation of software-based art, preservation of virtual reality, media archaeology, net art and web archiving. Our aim in organizing these conferences and workshops was to contribute to the international research in carrying both digital art and digital culture into the future, to discuss the results of new research, and to develop new and interdisciplinary modes of cooperation. We have reached a total number of 2000 participants through these events we have organized within the scope of the project, continuing our efforts online due to the pandemic, beginning from May 2020.

Presently, technological arts preservation is a common issue. Many problems such as erased digital photos, unrepairable, broken backup units, or records that would fall into oblivion due to discontinued media players now constitute a significant part of our daily lives. However, in terms of artworks, it is of vital importance that the matter should be handled with an interdisciplinary point of view within the context of preservation of cultural values. The book you have in front of you was prepared with great care and in awareness of the aforementioned responsibilities. Bringing together esteemed scholars, leading figures in arts and culture, artists as well as scientists, all expert names in their respective fields, this study includes comprehensive texts approaching the issue from different points of view. Consisting of three sections, the first part of the publication includes in-depth essays, the second part brings together content created based on the events we have conducted, and lastly, the third part chronicles answers of the artists to a questionnaire on the preservation of their work. We hope that this book will constitute a well-rounded source for those who have a sensibility for the very cultural values that make us human and how they may be carried into the future; we sincerely hope it will light the way for similar studies in the future.
CY  - Istanbul
DA  - 2021/06/18/
PY  - 2021
DP  - research.sabanciuniv.edu
SP  - 423
PB  - Sabancı University Sakıp Sabancı Museum
SN  - 9786257329163
UR  - https://www.sakipsabancimuzesi.org/en/page/technological-arts-preservation
Y2  - 2021/07/15/08:47:43
L1  - http://research.sabanciuniv.edu/41560/2/SSM_TechnologicalArtsPreservation.pdf
L2  - http://research.sabanciuniv.edu/41560/
ER  - 

TY  - ELEC
TI  - Legibility Machines: Archival Appraisal and the Genealogies of Use - ProQuest
AB  - Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.
DA  - 2021/07/15/08:48:15
PY  - 2021
LA  - hu
ST  - Legibility Machines
UR  - https://www.proquest.com/openview/18394f8f0fe123c09f08114c7b3d36f0/1?pq-origsite=gscholar&cbl=18750&diss=y
Y2  - 2021/07/15/08:48:15
L2  - https://www.proquest.com/openview/18394f8f0fe123c09f08114c7b3d36f0/1?pq-origsite=gscholar&cbl=18750&diss=y
ER  - 

TY  - CHAP
TI  - An Empirical Comparison of Web Page Segmentation Algorithms
AU  - Kiesel, Johannes
AU  - Meyer, Lars
AU  - Kneist, Florian
AU  - Stein, Benno
AU  - Potthast, Martin
T2  - Advances in Information Retrieval
A2  - Hiemstra, Djoerd
A2  - Moens, Marie-Francine
A2  - Mothe, Josiane
A2  - Perego, Raffaele
A2  - Potthast, Martin
A2  - Sebastiani, Fabrizio
AB  - Over the past two decades, several algorithms have been developed to segment a web page into semantically coherent units, a task with several applications in web content analysis. However, these algorithms have hardly been compared empirically and it thus remains unclear which of them—or rather, which of their underlying paradigms—performs best. To contribute to closing this gap, we report on the reproduction and comparative evaluation of ﬁve segmentation algorithms on a large, standardized benchmark dataset for web page segmentation: Three of the algorithms have been speciﬁcally developed for web pages and have been selected to represent paradigmatically different approaches to the task, whereas the other two approaches originate from the segmentation of photos and print documents, respectively. For a fair comparison, we tuned each algorithm’s parameters, if applicable, to the dataset. Altogether, the classic rule-based VIPS algorithm achieved the highest performance, closely followed by the purely visual approach of Cormier et al. For reproducibility, we provide our reimplementations of the algorithms along with detailed instructions.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - DOI.org (Crossref)
VL  - 12657
SP  - 62
EP  - 74
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-72239-5 978-3-030-72240-1
UR  - http://link.springer.com/10.1007/978-3-030-72240-1_5
Y2  - 2021/07/15/08:49:13
L1  - https://webis.de/downloads/publications/papers/kiesel_2021a.pdf
L1  - https://webis.de/downloads/publications/papers/kiesel_2021a.pdf
ER  - 

TY  - CHAP
TI  - The Rhetorical Lives and Afterlives of Political Pledges in British Political Speech c. 2000–2013
AU  - Freeman, James
T2  - Electoral Pledges in Britain Since 1918: The Politics of Promises
A2  - Thackeray, David
A2  - Toye, Richard
AB  - This chapter brings together quantitative and qualitative methods to examine the rhetoric of commitment-making at scale. It recovers more than 5000 speeches published online by Britain’s three main political parties between 2000 and 2013, and uses this new corpus to explore the rhetorical lives and afterlives of pledges made under the New Labour and Cameron Coalition governments. Text analysis and experimental artificial intelligence techniques highlight the cyclical patterns of promise-making, whilst manual tagging and close reading situate the archetypal policy pledge within a wider landscape of commitment rhetoric. Ultimately, the chapter argues that promises are significant, not just as indicators of party ideology or the health of democratic debate, but as rhetorical acts that organise, contribute to, and bind together the traditional appeals and parts of speech that make up political rhetoric.
CY  - Cham
DA  - 2020///
PY  - 2020
DP  - Springer Link
SP  - 291
EP  - 314
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-46663-3
UR  - https://doi.org/10.1007/978-3-030-46663-3_14
Y2  - 2021/07/15/08:50:03
KW  - Deep learning
KW  - Political history
KW  - Promises
KW  - Rhetoric
KW  - Text analysis
ER  - 

TY  - JOUR
TI  - Assessing the loss of Western Canadian digital heritage
AU  - Saiyera, Tasbire
AU  - Ayala, Brenda Reyes
AU  - Du, Qiufeng
T2  - Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI
DA  - 2021/05/31/
PY  - 2021
DO  - 10.29173/cais1218
DP  - journals.library.ualberta.ca
LA  - en
SN  - 2562-7589
UR  - https://journals.library.ualberta.ca/ojs.cais-acsi.ca/index.php/cais-asci/article/view/1218
Y2  - 2021/07/15/08:51:26
L1  - https://journals.library.ualberta.ca/ojs.cais-acsi.ca/index.php/cais-asci/article/download/1218/1054
KW  - web archives
ER  - 

TY  - RPRT
TI  - Technical Report - Portuguese Web Archive Image Search
AU  - Mourão, André
AU  - Melo, Fernando
AB  - The popular sentence a picture paints a thousand words illustrates the information richness an image can provide. There are billions of images available on the Web. Such graphic and other complex resources require special search capabilities. In the late 90’s, Altavista released the ﬁrst major text to image search engine on the Web, followed by Google in 2001. Searching for images is a prominent need for users in the live Web. Thus, ﬁnding images using text in Web Archives is an important task, as it adds a temporal perspective on how images in the web change. User studies show that a signiﬁcant number of users use Web Archives to ﬁnd past images for a speciﬁc subjects or events. In 2016, the Internet Archive has released an image search portal for Animated GIF images harvested from the Geocities website. In 2017 the Royal Danish Library developed a new Wayback software, with image search capabilities, and in 2018 Arquivo.pt has launched an image search beta service, allowing temporal image searches (e.g. to search for images related with the Olympic games between 1996 and 2010).
CY  - Lisboa
DA  - 2020///
PY  - 2020
DP  - Zotero
SP  - 19
LA  - en
UR  - https://sobre.arquivo.pt/wp-content/uploads/Searching_Images_Within_a_WebArchive.pdf
L1  - https://sobre.arquivo.pt/wp-content/uploads/Searching_Images_Within_a_WebArchive.pdf
ER  - 

TY  - JOUR
TI  - A Deep Learning Approach to Identify Not Suitable for Work Images
AU  - Bicho, Daniel
AB  - Web Archiving (WA) deals with the preservation of portions of the World Wide Web (WWW) allowing their availability for the future. Arquivo.pt is a WA initiative holding a huge amount of content, including image ﬁles. However, some of these images contain nudity and pornography, that can be offensive for the users, and thus being Not Suitable For Work (NSFW). This work proposes a solution to classify NSFW images found at Arquivo.pt, with deep neural network approaches. A large dataset of images is built using Arquivo.pt data and two pre-trained neural network models, namely ResNet and SqueezeNet, are evaluated and improved for the NSFW classiﬁcation task, using the dataset. The evaluation of these models reported an accuracy of 93% and 72%, respectively. After a ﬁne tuning stage, the accuracy of these models improved to 94% and 89%, respectively. The proposed solution is integrated into the Arquivo.pt Image Search System, available at https://arquivo.pt/images.jsp.
DA  - 2020///
PY  - 2020
DP  - Zotero
VL  - 6
IS  - 1
SP  - 11
LA  - en
L1  - https://repositorio.ipl.pt/bitstream/10400.21/12354/1/A%20deep_AFerreira.pdf
ER  - 

TY  - JOUR
TI  - Survivors: Archiving the history of bulletin board systems and the AIDS crisis
AU  - Brewster, Kathryn
AU  - Ruberg, Bonnie
T2  - First Monday
AB  - The history of the Internet and the history of the HIV/AIDS crisis are fundamentally intertwined. Because of the precarious nature of primary early Internet materials, however, documentation that reflects this relationship is limited. Here, we present and analyse an important document that offers considerable insight in this area: a full printout of the bulletin board system (BBS) discussion group “SURVIVORS.” Run by David Charnow, SURVIVORS operated as an “electronic support group” for members living with HIV/AIDS from 1987 to 1990. These dates represent a period of overlap between both the AIDS crisis in America and the use of BBSs as a predecessor to contemporary Internet technologies. The contents of SURVIVORS were printed by Charnow before his death in 1990 and later donated to the ONE National Gay and Lesbian Archives. Through our discussion of these documents, we articulate the striking relationship between the SURVIVORS printout as a material document that preserves a digital past and the lives of those whose stories are contained within the printout. We argue that it is not only the content but indeed the precarious, shifting media format of the SURVIVORS printout, born digital and now preserved on paper, that gives it its meaning. Thirty years after his death, Charnow’s printout of SURVIVORS keeps a critical piece of the interrelated histories of HIV/AIDS and the Internet alive, while also raising valuable questions about the archiving of these histories.
DA  - 2020/09/22/
PY  - 2020
DO  - 10.5210/fm.v25i10.10290
DP  - firstmonday.org
LA  - en
SN  - 1396-0466
ST  - SURVIVORS
UR  - https://firstmonday.org/ojs/index.php/fm/article/view/10290
Y2  - 2021/07/15/08:54:34
KW  - Archives
KW  - BBS
KW  - HIV/AIDS
ER  - 

TY  - JOUR
TI  - Using mixed methods to study the historical use of web beacons in web tracking
AU  - Nielsen, Janne
T2  - International Journal of Digital Humanities
AB  - Historical studies of the use of tracking technologies collecting data about web users and their behaviour can help us understand the spread and implications of web tracking. This article presents a historical study of the use of a specific tracking technology, the web beacon, on the Danish web from 2006 to 2015 using archived web materials from the national Danish web archive. The study combines a large-scale quantitative mapping of the use of web beacons on the Danish web with a qualitative study of specific websites. Using this mixed-method design, the article identifies the prevalent third-party domains setting web beacons and the different purposes for beacon use. The findings show the ratio of Danish to international third-party domains involved in the tracking and the development, over time, of what types of beacon providers are dominant on the Danish web. The article also addresses the methodological challenges related to using archived web for a mixed-method historical study of web tracking.
DA  - 2021/04/27/
PY  - 2021
DO  - 10.1007/s42803-021-00033-4
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
UR  - https://doi.org/10.1007/s42803-021-00033-4
Y2  - 2021/07/15/08:54:49
ER  - 

TY  - JOUR
TI  - Digital Preservation: challenges, requirements, strategies and scientific output
AU  - Formenton, Danilo
AU  - Gracioso, Luciana
AB  - The aim of this article is to provide a broad and reflective perspective on the main aspects of digital preservation, based on the challenges indicated, the recognized requirements and the strategies analyzed by the scientific community. The methodology adopts quantitative-qualitative and exploratory-descriptive research, with a review of the national and international literature of the last twenty-one years on digital preservation, in order to delineate the trends and policies on the theme as well as deepening the discussion on the needs for archiving and long-term preservation of digital content. Data is analyzed from the bibliographic survey of scientific publications indexed by Scopus and Web of Science from the last five years (2015-2019) that deal with the subject "digital preservation". It was found that among the themes discussed, budgets, costs and metadata for preserving and Web archiving are emerging and studies lacking in Brazilian Information Science. In the international scientific output, Brazil stands out for publication quantity, indicating a maturation of the theme, coinciding with the advance of national projects, such as the Cariniana Network. However, we have financial, human and technological demands that, together with the characteristics of strategies for digital preservation, highlight the usefulness of collaborations and of little-explored national topics.
DA  - 2020/06/14/
PY  - 2020
DP  - ResearchGate
VL  - 18
SP  - 1
EP  - 26
ST  - Digital Preservation
L4  - https://www.researchgate.net/profile/Danilo-Formenton/publication/342163605_Digital_Preservation_challenges_requirements_strategies_and_scientific_output/links/5ee6477b92851ce9e7e39d74/Digital-Preservation-challenges-requirements-strategies-and-scientific-output.pdf
ER  - 

TY  - JOUR
TI  - Text and Data Mining for the National Library of Greece in consideration of Internet Security and GDPR
AU  - Vavousis, Konstantinos
AU  - Papadopoulos, Marinos
AU  - Gerolimos, Michalis
AU  - Xenakis, Christos
T2  - Qualitative and Quantitative Methods in Libraries
AB  - Text and Data Mining (TDM) as a technological option is usually leveraged upon by large libraries worldwide in the technologically enhanced processes of web-harvesting and web-archiving with the aim to collect, download, archive, and preserve content and works that are found available on the Internet. TDM is used to index, analyze, evaluate and interpret mass quantities of works including texts, sounds, images or data through an automated "tracking and pulling" process of online material. Access to the web content and works available online are subject to restrictions by legislation, especially to laws pertaining to Copyright, Industrial Property Rights and Data Privacy. As far as Data Privacy is concerned, the application of the General Data Protection Regulation (GDPR) is considered as an issue of vital importance for the smooth operation of TDM service offered by national libraries mostly in the EU Member States, which among other requirements mandates the adoption of privacy-by-design and advanced security techniques. This article focuses on the TDM deployed by National Library of Greece (NLG) and considerations for applied Internet Security solutions taking into account GDPR requirements. NLG has deployed TDM as of February 2017 in consideration of the provision of art.4(4)(b) of Law 4452/2017, as well as of the provisions of Regulation 2016/679/EU (GDPR). Art.4(4)(b) of law 4452/2017 sets the TDM activity in Greece under the responsibility of NLG, appointed as the organization to undertake, allocate and coordinate the action of archiving the Hellenic web, i.e. as the organization responsible for text and data analysis at national level in Greece. The deployment of TDM by NLG, presented by the authors, caters for a framework of technical and legal considerations, so that the electronic service enabled based on the TDM operation complies with the data protection requirements set by the new EU legislation. While the presentation elaborates upon minimum set of technical Internet Security means considered by NLG for achieving GDPR compliance, the paper (to-be-published) focuses on TDM and GDPR issues specifically in relation to art.89 of GDPR titled “Safeguards and derogations relating to processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes” that is a key-tern ruling for the operation of NLG in compliance with GDPR.
DA  - 2020/10/05/
PY  - 2020
DP  - 78.46.229.148
VL  - 9
IS  - 3
SP  - 441
EP  - 460
LA  - en
SN  - 2241-1925
UR  - http://78.46.229.148/ojs/index.php/qqml/article/view/637
Y2  - 2021/07/15/08:59:22
L1  - http://78.46.229.148/ojs/index.php/qqml/article/download/637/591
ER  - 

TY  - JOUR
TI  - The National Library of Medicine Global Health Events web archive, coronavirus disease (COVID-19) pandemic collecting
AU  - Speaker, Susan L.
AU  - Moffatt, Christie
T2  - Journal of the Medical Library Association : JMLA
AB  - Since January 30, 2020, when the World Health Organization declared the SARS CoV-2 disease (COVID-19) to be a public health emergency of international concern, the National Library of Medicine's (NLM's) Web Collecting and Archiving Working Group has been collecting a broad range of web-based content about the emerging pandemic for preservation in an Internet archive. Like NLM's other Global Health Events web collections, this content will have enduring value as a multifaceted historical record for future study and understanding of this event. This article describes the scope of the COVID-19 project; some of the content captured from websites, blogs, and social media; collecting criteria and methods; and related COVID-19 collecting efforts by other groups. The growing collection—2,500 items as of June 30, 2020—chronicles the many facets of the pandemic: epidemiology; vaccine and drug research; disease control measures and resistance to them; effects of the pandemic on health care institutions and workers, education, commerce, and many aspects of social life; effects for especially vulnerable groups; role of health disparities in infection and mortality; and recognition of racism as a public health emergency.
DA  - 2020///
PY  - 2020
DO  - 10.5195/jmla.2020.1090
DP  - PubMed Central
VL  - 108
IS  - 4
SP  - 656
EP  - 662
J2  - J Med Libr Assoc
SN  - 1536-5050
UR  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7524615/
Y2  - 2021/07/15/09:00:00
L1  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7524615/pdf/jmla-108-4-656.pdf
L2  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7524615/
ER  - 

TY  - CONF
TI  - Making Recommendations from Web Archives for
AU  - Alkwai, Lulwah M.
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
T3  - JCDL '20
AB  - When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by Uniform Resource Identifier (URI) lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are potentially similar to the requested web page. In this paper, we propose augmenting these binary responses with a model for selecting and ranking recommended web pages in a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing web pages in the archive that the user may not know existed. First, we check if the URI is already classified in DMOZ or Wikipedia. If the requested URI is not found, we use machine learning to classify the URI using DMOZ as our ontology and collect candidate URIs to recommended to the user. The classification is in two parts, a first-level classification and a deep classification. Next, we filter the candidates based on if they are present in the archive. Finally, we rank candidates based on several features, such as archival quality, web page popularity, temporal similarity, and URI similarity. We calculated the F1 score for different methods of classifying the requested web page at the first level. We found that using all-grams from the URI after removing numerals and the top-level domain (TLD) produced the best result with F1 =0.59. For the deep-level classification, we measured the accuracy at each classification level. For second-level classification, the micro-average F1=0.30 and for third-level classification, F1=0.15. We also found that 44.89% of the correctly classified URIs contained at least one word that exists in a dictionary and 50.07% of the correctly classified URIs contained long strings in the domain. In comparison with the URIs from our Wayback access logs, only 5.39% of those URIs contained only words from a dictionary, and 26.74% contained at least one word from a dictionary. These percentages are low and may affect the ability for the requested URI to be correctly classified.
C1  - New York, NY, USA
C3  - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
DA  - 2020/00/01/
PY  - 2020
DO  - 10.1145/3383583.3398533
DP  - ACM Digital Library
SP  - 87
EP  - 96
PB  - Association for Computing Machinery
SN  - 978-1-4503-7585-6
UR  - https://doi.org/10.1145/3383583.3398533
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3383583.3398533
KW  - web archiving
KW  - URI
KW  - classifying
KW  - recommending
ER  - 

TY  - JOUR
TI  - Big data experiments with the archived Web: Methodological reflections on studying the development of a nation's Web
AU  - Brügger, Niels
AU  - Nielsen, Janne
AU  - Laursen, Ditte
T2  - First Monday
AB  - This article outlines how the 'digital geography' of a nation can be studied, that is the online presence of one nation. The entire Danish Web domain and its development from 2006 to 2015 is used as a case, based on the holdings in the Danish national Web archive. The following research questions guide the investigation: What has the Danish Web domain looked like in the past, and how has it developed in the period 2006-2015? Methodologically, we investigate to what extent one can delimit 'a nation' on the Web, and what characterizes the archived Web as a historical source for academic studies, as well as the general characteristics of our specific data source. Analytically, the article introduces a design for how this type of big data analyses of an entire national Web domain can be performed. Our findings show some of the ways in which a nation's digital landscape can be mapped, ie. on size, content types and hyperlinks. On a broader canvas, this study demonstrates that with hard- and software as well as human competencies from different disciplines it is possible to perform large-scale historical studies of one of the biggest media sources of today, the World Wide Web.
DA  - 2020/02/10/
PY  - 2020
DO  - 10.5210/fm.v25i3.10384
DP  - journals.uic.edu
LA  - en
SN  - 1396-0466
ST  - Big data experiments with the archived Web
UR  - https://journals.uic.edu/ojs/index.php/fm/article/view/10384
Y2  - 2021/07/15/09:01:35
KW  - big data
KW  - historiography
KW  - Web history
KW  - geography
KW  - the World Wide Web
ER  - 

TY  - JOUR
TI  - Cuéntalo: the path between archival activism and the social archive(s)
AU  - Ruiz Gómez, Vicenç
AU  - Maria Vallès, Aniol
T2  - Archives and Manuscripts
DA  - 2020/09/01/
PY  - 2020
DO  - 10.1080/01576895.2020.1802306
DP  - DOI.org (Crossref)
VL  - 48
IS  - 3
SP  - 271
EP  - 290
J2  - Archives and Manuscripts
LA  - en
SN  - 0157-6895, 2164-6058
ST  - #Cuéntalo
UR  - https://www.tandfonline.com/doi/full/10.1080/01576895.2020.1802306
Y2  - 2021/07/15/09:02:37
ER  - 

TY  - JOUR
TI  - Robustifying Links To Combat Reference Rot
AU  - Links!), Shawn M. Jones (0000-0002-4372-870XCurrent version of pageVersion archived on 2020-12-11Version archived near 2020-12-11Robustify Your
AU  - Links!), Martin Klein (0000-0003-0130-2097Current version of pageVersion archived on 2020-12-11Version archived near 2020-12-11Robustify Your
AU  - Links!), Herbert Van de Sompel (0000-0002-0715-6126Current version of pageVersion archived on 2020-12-11Version archived near 2020-12-11Robustify Your
T2  - The Code4Lib Journal
AB  - Links to web resources frequently break, and linked content can change at unpredictable rates. These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information. In this paper, we highlight the significance of reference rot, provide an overview of existing techniques and their characteristics to address it, and introduce our Robust Links approach, including its web service and underlying API. Robustifying links offers a proactive, uniform, and machine-actionable way to combat reference rot. In addition, we discuss our reasoning and approach aimed at keeping the approach functional for the long term. To showcase our approach, we have robustified all links in this article.
DA  - 2021/02/10/
PY  - 2021
DP  - Code4Lib Journal
IS  - 50
SN  - 1940-5758
UR  - https://journal.code4lib.org/articles/15509?utm_campaign=La%20veille%20de%20l%27Infoth%C3%A8que%20HEG%20Gen%C3%A8ve%20&utm_medium=email&utm_source=Revue%20newsletter
Y2  - 2021/07/15/09:04:25
L2  - https://journal.code4lib.org/articles/15509?utm_campaign=La%20veille%20de%20l%27Infoth%C3%A8que%20HEG%20Gen%C3%A8ve%20&utm_medium=email&utm_source=Revue%20newsletter
ER  - 

TY  - JOUR
TI  - Social media data archives in an API-driven world
AU  - Acker, Amelia
AU  - Kreisberg, Adam
T2  - Archival Science
AB  - In this article, we explore the long-term preservation implications of application programming interfaces (APIs) which govern access to data extracted from social media platforms. We begin by introducing the preservation problems that arise when APIs are the primary way to extract data from platforms, and how tensions fit with existing models of archives and digital repository development. We then define a range of possible types of API users motivated to access social media data from platforms and consider how these users relate to principles of digital preservation. We discuss how platforms’ policies and terms of service govern the set of possibilities for access using these APIs and how the current access regime permits persistent problems for archivists who seek to provide access to collections of social media data. We conclude by surveying emerging models for access to social media data archives found in the USA, including community driven not-for-profit community archives, university research repositories, and early industry–academic partnerships with platforms. Given the important role these platforms occupy in capturing and reflecting our digital culture, we argue that archivists and memory workers should apply a platform perspective when confronting the rich problem space that social platforms and their APIs present for the possibilities of social media data archives, asserting their role as “developer stewards” in preserving culturally significant data from social media platforms.
DA  - 2020/06/01/
PY  - 2020
DO  - 10.1007/s10502-019-09325-9
DP  - Springer Link
VL  - 20
IS  - 2
SP  - 105
EP  - 123
J2  - Arch Sci
LA  - en
SN  - 1573-7500
UR  - https://doi.org/10.1007/s10502-019-09325-9
Y2  - 2021/07/15/09:04:57
L1  - https://link.springer.com/content/pdf/10.1007%2Fs10502-019-09325-9.pdf
KW  - Social media
KW  - APIs
KW  - data archives
KW  - Developer stewards
KW  - Platform perspective
ER  - 

TY  - JOUR
TI  - The Wayback Machine: notes on a re-enchantment
AU  - Bowyer, Surya
T2  - Archival Science
AB  - The Internet Archive’s Wayback Machine holds over 424 billion webpages, making it the largest publicly accessible archive in the world. Thus far, much of the research on the Machine has approached the technology using computational thinking. This type of thinking treats technology operationally, as something that we can use to do jobs for us. This article takes a different approach. It steps back from computational thinking to consider the language we use to apprehend technology. It argues that the metaphors we use actually obfuscate, rather than merely describe, the operations of the Machine. By making explicit the workings of these metaphors, the article draws attention to, and thus counteracts, this obfuscation. In so doing, these notes on the Wayback Machine point more widely towards the usefulness of a language-oriented approach to other technologies.
DA  - 2021/03/01/
PY  - 2021
DO  - 10.1007/s10502-020-09345-w
DP  - Springer Link
VL  - 21
IS  - 1
SP  - 43
EP  - 57
J2  - Arch Sci
LA  - en
SN  - 1573-7500
ST  - The Wayback Machine
UR  - https://doi.org/10.1007/s10502-020-09345-w
Y2  - 2021/07/15/09:05:11
ER  - 

TY  - CHAP
TI  - 28 - After COVID? Classical mechanics
AU  - Hawley, Graeme
T2  - Libraries, Digital Information, and COVID
A2  - Baker, David
A2  - Ellis, Lucy
T3  - Chandos Digital Information Review
AB  - COVID-19 has understandably been foremost in our minds over the last year and will continue to be for some time, but it is not the only urgent crisis that individuals, societies, and nations face. This essay looks at current events through the lens of Alvin Toffler’s publication The Third Wave, focusing especially on the accelerative nature of change today and how it increases complexity. Graeme Hawley, Head of General Collections at the National Library of Scotland, considers what accelerative change means in terms of the collections he is responsible for, and the extent to which COVID-19 is likely to impact accelerative change in the immediate future. The essay takes a broad look at topics that, although distinct in themselves, all share the qualities of velocity, and all seem to be happening at roughly the same time so that we can situate the post-COVID world in its fuller context.
DA  - 2021/00/01/
PY  - 2021
DP  - ScienceDirect
SP  - 291
EP  - 302
LA  - en
PB  - Chandos Publishing
SN  - 978-0-323-88493-8
ST  - 28 - After COVID?
UR  - https://www.sciencedirect.com/science/article/pii/B9780323884938000367
Y2  - 2021/07/15/09:05:38
L2  - https://www.sciencedirect.com/science/article/pii/B9780323884938000367
KW  - Web archiving
KW  - National libraries
KW  - Social media
KW  - Accelerative change
KW  - Complexity
KW  - Digital publishing
KW  - National Library of Scotland
KW  - Velocity
ER  - 

TY  - THES
TI  - Linked Research on the Decentralised Web
AU  - Capadisli, Sarven
CY  - Bonn
DA  - 2020///
PY  - 2020
DP  - Zotero
LA  - en
PB  - Universität Bonn
UR  - https://bonndoc.ulb.uni-bonn.de/xmlui/handle/20.500.11811/8352
L1  - https://bonndoc.ulb.uni-bonn.de/xmlui/bitstream/handle/20.500.11811/8352/5815.pdf?sequence=1&isAllowed=y
ER  - 

TY  - JOUR
TI  - Visualizing Webpage Changes Over Time
AU  - Mabe, Abigail
AU  - Patel, Dhruv
AU  - Gunnam, Maheedhar
AU  - Shankar, Surbhi
AU  - Kelly, Mat
AU  - Alam, Sawood
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
T2  - arXiv:2006.02487 [cs]
AB  - We report on the development of TMVis, a web service to provide visualizations of how individual webpages have changed over time. We leverage past research on summarizing collections of webpages with thumbnail-sized screenshots and on choosing a small number of representative past archived webpages from a large collection. We offer four visualizations: image grid, image slider, timeline, and animated GIF. Embed codes for the image grid and image slider can be produced to include these on separate webpages. The animated GIF can be downloaded as an image file for the same purpose. This tool can be used to allow scholars from various disciplines, as well as the general public, to explore the temporal nature of web archives. We hope that these visualizations will just be the beginning and will provide a starting point for others to expand these types of offerings for users of web archives.
DA  - 2020/06/03/
PY  - 2020
DP  - arXiv.org
UR  - http://arxiv.org/abs/2006.02487
Y2  - 2021/07/15/09:08:00
L1  - https://arxiv.org/pdf/2006.02487.pdf
L1  - https://arxiv.org/pdf/2006.02487.pdf
L2  - https://arxiv.org/abs/2006.02487
L2  - https://arxiv.org/abs/2006.02487
KW  - Computer Science - Digital Libraries
ER  - 

TY  - JOUR
TI  - Preserving Data Journalism: A Systematic Literature Review
AU  - Heravi, Bahareh
AU  - Cassidy, Kathryn
AU  - Davis, Edie
AU  - Harrower, Natalie
T2  - Journalism Practice
AB  - News organisations have longstanding practices for archiving and preserving their content. The emerging practice of data journalism has led to the creation of complex new outputs, including dynamic data visualisations that rely on distributed digital infrastructures. Traditional news archiving does not yet have systems in place for preserving these outputs, which means that we risk losing this crucial part of reporting and news history. Following a systematic approach to studying the literature in this area, this paper provides a set of recommendations to address lacunae in the literature. This paper contributes to the field by (1) providing a systematic study of the literature in the fields, (2) providing a set of recommendations for the adoption of long-term preservation of dynamic data visualisations as part of the news publication workflow, and (3) identifying concrete actions that data journalists can take immediately to ensure that these visualisations are not lost.
DA  - 2021/03/31/
PY  - 2021
DO  - 10.1080/17512786.2021.1903972
DP  - Taylor and Francis+NEJM
VL  - 0
IS  - 0
SP  - 1
EP  - 23
SN  - 1751-2786
ST  - Preserving Data Journalism
UR  - https://doi.org/10.1080/17512786.2021.1903972
Y2  - 2021/07/15/09:08:54
L1  - https://www.tandfonline.com/doi/pdf/10.1080/17512786.2021.1903972
L2  - https://www.tandfonline.com/doi/full/10.1080/17512786.2021.1903972
KW  - digital preservation
KW  - software preservation
KW  - data journalism
KW  - data visualisation
KW  - data visualization
KW  - data-driven journalism
KW  - digital archiving
ER  - 

TY  - JOUR
TI  - Proactive ephemerality: How journalists use automated and manual tweet deletion to minimize risk and its consequences for social media as a public archive
AU  - Ringel, Sharon
AU  - Davidson, Roei
T2  - New Media & Society
AB  - Despite their ephemeral constantly changing nature, social media constitute an archive of public discourse. In this study, we examine when, how, and why journalists practice proactive ephemerality, deleting their tweets either manually or automatically to consider the viability of social media as a public record. Based on interviews conducted with journalists in New York City, we find many journalists delete their tweets, and that software-aided mass deletion is common, damaging Twitter’s standing as an archive. Through deletion, journalists manipulate temporality, exposing the public to a brief tweeting window to reduce risks and regain control in a precarious labor market and a harassment-ridden public sphere in which employers leave them largely unprotected. When deleting tweets mechanically, journalists emulate platform logic by depending—as commercial platforms often do—on automatic procedures rather than on human expertise. This constitutes a surrender of the very qualities that make human judgment so valuable.
DA  - 2020/00/13/
PY  - 2020
DO  - 10.1177/1461444820972389
DP  - SAGE Journals
SP  - 1461444820972389
J2  - New Media & Society
LA  - en
SN  - 1461-4448
ST  - Proactive ephemerality
UR  - https://doi.org/10.1177/1461444820972389
Y2  - 2021/07/15/09:09:40
L1  - https://journals.sagepub.com/doi/pdf/10.1177/1461444820972389
KW  - Twitter
KW  - digital archives
KW  - Deletion
KW  - ephemerality
KW  - journalists
KW  - platformization
ER  - 

TY  - JOUR
TI  - Webarchívum mint a tudományos kutatások tárgya
AU  - Németh, Márton
T2  - Tudományos és Műszaki Tájékoztatás
DA  - 2020/12/08/
PY  - 2020
DP  - tmt.omikk.bme.hu
VL  - 67
IS  - 12
SP  - 757
EP  - 765
LA  - hu
UR  - https://tmt.omikk.bme.hu/tmt/article/view/12804
Y2  - 2021/07/15/09:10:04
L1  - https://tmt.omikk.bme.hu/tmt/article/download/12804/14543
KW  - kutatás
ER  - 

TY  - JOUR
TI  - Digital humanities and web archives: Possible new paths for combining datasets
AU  - Brügger, Niels
T2  - International Journal of Digital Humanities
AB  - This article discusses the importance of web archives making their collections available as data and not only as sources seen through the Wayback Machine’s interface where only individual web pages are displayed. This will help unlock the full potential of the treasure trove that web archives constitute, and thereby also open up for methods from the wider field of digital humanities. Based on a case study of the entire Danish web domain .dk the article discusses methodological challenges involved in combining large extracted datasets from web archives, namely metadata about the size of websites and data about hyperlinks from the same websites. The aim is to answer the following two questions: 1) How to combine two different types of datasets extracted from a web archive, in this case the Danish Netarkivet? 2) What can the result of such a combination teach us about the structural characteristics of the Danish web domain from 2006 to 2015? The article shows that, indeed, it is possible to go beyond the Wayback Machine as the prime interface to web archives by combining two distinct datasets, and that such a venture can provide valuable knowledge about the overall structure of the Danish web domain, thus highlighting that websites of the same size tend to constitute isolated ‘link islands’, and that big websites are also the most important in the hyperlink network, which is more clearly the case in 2015 than in 2006.
DA  - 2021/05/28/
PY  - 2021
DO  - 10.1007/s42803-021-00038-z
DP  - Springer Link
J2  - Int J Digit Humanities
LA  - en
SN  - 2524-7840
ST  - Digital humanities and web archives
UR  - https://doi.org/10.1007/s42803-021-00038-z
Y2  - 2021/07/15/09:10:23
ER  - 

TY  - CONF
TI  - Building Web Corpora for Minority Languages
AU  - Jauhiainen, Heidi
AU  - Jauhiainen, Tommi
AU  - Lindén, Krister
AB  - Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the “Finno-Ugric Languages and the Internet” (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.
C1  - Marseille, France
C3  - Proceedings of the 12th Web as Corpus Workshop
DA  - 2020/05//
PY  - 2020
DP  - ACLWeb
SP  - 23
EP  - 32
LA  - English
PB  - European Language Resources Association
SN  - 979-10-95546-68-9
UR  - https://aclanthology.org/2020.wac-1.4
Y2  - 2021/07/15/09:10:38
L1  - https://aclanthology.org/2020.wac-1.4.pdf
ER  - 

TY  - JOUR
TI  - Remembering is a form of honouring: preserving the COVID-19 archival record
AU  - Jones, Esyllt W.
AU  - Sweeney, Shelley
AU  - Milligan, Ian
AU  - Bak, Greg
AU  - McCutcheon, Jo-Anne
T2  - FACETS
DA  - 2021/00/01/
PY  - 2021
DO  - 10.1139/facets-2020-0115
DP  - facetsjournal.com (Atypon)
VL  - 6
SP  - 545
EP  - 568
ST  - Remembering is a form of honouring
UR  - https://www.facetsjournal.com/doi/full/10.1139/facets-2020-0115
Y2  - 2021/07/15/09:11:02
L1  - https://www.facetsjournal.com/doi/pdf/10.1139/facets-2020-0115
ER  - 

TY  - JOUR
TI  - Space, time, and culture on African/diaspora websites: a tangled web we weave
AU  - Brinkman, Inge
AU  - Merolla, Daniela
T2  - Journal of African Cultural Studies
AB  - The four articles presented in this collection on the topic ‘Space, time, and culture on African/diaspora websites’ address African/diaspora websites and their networks in divergent ways and focus on a number of broad themes. These are: that the future is located in and shaped by the past; movement and displacement in geographical terms are re-conceptualised through website activities to produce discourses of identity; online and offline interactions are linked and there is fluid passage between the two; while web design programmes define the visual and content-based structure in vertical and horizontal sections, the websites analysed here are individualised by the specific collection of texts and images and the more or less extended presence of sounds and videos; the Internet has enhanced the possibilities for visibility and has opened up a range of new dynamics for creating and reaching out to publics, also for individuals or groups that are not likely to acquire a voice on traditional media platforms.
DA  - 2020/00/02/
PY  - 2020
DO  - 10.1080/13696815.2019.1657003
DP  - Taylor and Francis+NEJM
VL  - 32
IS  - 1
SP  - 1
EP  - 6
SN  - 1369-6815
ST  - Space, time, and culture on African/diaspora websites
UR  - https://doi.org/10.1080/13696815.2019.1657003
Y2  - 2021/07/15/09:11:22
L1  - https://www.tandfonline.com/doi/pdf/10.1080/13696815.2019.1657003
L2  - https://www.tandfonline.com/doi/full/10.1080/13696815.2019.1657003
KW  - websites
KW  - time
KW  - and culture
KW  - space
ER  - 

TY  - CONF
TI  - WARChain: Blockchain-Based Validation of Web Archives
AU  - Lendák, Imre
AU  - Indig, Balázs
AU  - Palkó, Gábor
A2  - Groß, Thomas
A2  - Viganò, Luca
T3  - Lecture Notes in Computer Science
AB  - Background. Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies.Aim. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that the crawled documents are authentic for many years to come.Method. We developed our archive validation solution as an extension and continuation of our work in web crawler development, mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure.Results. We implemented a prototype of the proposed solution in Python and C#. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps which consisted of 1 million articles in total.Conclusions. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.
C1  - Cham
C3  - Socio-Technical Aspects in Security and Trust
DA  - 2021///
PY  - 2021
DO  - 10.1007/978-3-030-79318-0_7
DP  - Springer Link
SP  - 121
EP  - 134
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-79318-0
ST  - WARChain
KW  - Web archive
KW  - Web crawling
KW  - Blockchain
KW  - Censorship
KW  - Proof-of-stake
KW  - Validation
ER  - 

TY  - JOUR
TI  - Archive This Moment D.C.: A Case Study of Participatory Collecting During COVID-19
AU  - Burns, Julie
AU  - Farley, Laura
AU  - Hagan, Siobhan C.
AU  - Kelly, Paul
AU  - Warwick, Lisa
T2  - The Code4Lib Journal
AB  - When the COVID-19 pandemic brought life in Washington, D.C. to a standstill in March 2020, staff at DC Public Library began looking for ways to document how this historic event was affecting everyday life. Recognizing the value of first-person accounts for historical research, staff launched Archive This Moment D.C. to preserve the story of daily life in the District during the stay-at-home order. Materials were collected from public Instagram and Twitter posts submitted through the hashtag #archivethismomentdc. In addition to social media, creators also submitted materials using an Airtable webform set up for the project and through email. Over 2,000 digital files were collected. , This article will discuss the planning, professional collaboration, promotion, selection, access, and lessons learned from the project; as well as the technical setup, collection strategies, and metadata requirements. In particular, this article will include a discussion of the evolving collection scope of the project and the need for clear ethical guidelines surrounding privacy when collecting materials in real-time.
DA  - 2021/02/10/
PY  - 2021
DP  - Code4Lib Journal
IS  - 50
SN  - 1940-5758
ST  - Archive This Moment D.C.
UR  - https://journal.code4lib.org/articles/15534
Y2  - 2021/07/15/09:12:58
L2  - https://journal.code4lib.org/articles/15534
ER  - 

TY  - RPRT
TI  - Collection plan for online materials 2021-2024
AU  - Heikkinen, Jari
AU  - Kaunonen, Kaisa
AU  - Lindholm, Erik
AU  - Merioksa, Mikko
AU  - Pitkälä, Matti
AU  - Vahtola, Aija
AU  - Veikkolainen, Petteri
CY  - Helsinki
DA  - 2021///
PY  - 2021
DP  - Zotero
SP  - 8
LA  - en
PB  - National Library of Finland
UR  - https://www.doria.fi/bitstream/handle/10024/180970/Collection%20plan%20for%20online%20materials%202021%E2%80%932024.pdf?sequence=1
Y2  - 2021/08/06/
L1  - https://www.doria.fi/bitstream/handle/10024/180970/Collection%20plan%20for%20online%20materials%202021%E2%80%932024.pdf?sequence=1
ER  - 

TY  - CONF
TI  - Research on Ontology and Linked Data-oriented Construction of Government Website Web Archive Thematic Knowledge Base
AU  - Huang, Xinping
T2  - 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)
AB  - This research aims to explore the application of semantic web technologies such as ontology and linked data in the preservation of government website information resources. By constructing a government website Web Archive thematic knowledge base driven by semantic web technology, it can improve the historical traceability and data mining in the longterm preservation environment. The utilization rate and service quality of the original information of government websites with multiple utilization values such as reference for decision-making. On the basis of summarizing relevant research systems at home and abroad, this paper designs the special knowledge base of the government website Web Archive from three levels, namely, requirement analysis of the knowledge base system, conceptual structure design and logical structure design, and refers to the hierarchical idea of the OAIS model. And the Web Archive thematic knowledge base model for government websites based on ontology and associated data is constructed. Finally, the core process and key technology of the construction of the government website Web Archive thematic knowledge base are proposed, in order to provide reference and reference for research on longterm preservation of government website information resources.
C3  - 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)
DA  - 2021/06//
PY  - 2021
DO  - 10.1109/ICOEI51242.2021.9452984
DP  - IEEE Xplore
SP  - 928
EP  - 934
L2  - https://ieeexplore.ieee.org/abstract/document/9452984
KW  - Ontologies
KW  - Web Archive
KW  - Semantic Web
KW  - Linked Data
KW  - Analytical models
KW  - Government
KW  - Government Website
KW  - Information services
KW  - Knowledge based systems
KW  - Ontology
KW  - Systematics
KW  - Thematic Knowledge Base
ER  - 

TY  - JOUR
TI  - Collecting Pennsylvania Political Twitter Data
AU  - Dudash, Andrew M.
AU  - Russell, John E.
T2  - Pennsylvania Libraries: Research & Practice
AB  - During the two most recent elections we have seen the importance of social media, and Twitter in particular, for political discourse. This paper describes the effort of an academic library to collect election-related Twitter data from Pennsylvania-specific organizational accounts and hashtags for 2018 and 2020 in the run-up and aftermath of both election cycles. Because of its importance to understanding contemporary politics and its historic value, libraries need to consider the opportunity to collect and make this data accessible to Pennsylvanians.
DA  - 2021/06/29/
PY  - 2021
DO  - 10.5195/palrap.2021.249
DP  - palrap.pitt.edu
VL  - 9
IS  - 1
SP  - 4
EP  - 7
LA  - en
SN  - 2324-7878
UR  - http://palrap.pitt.edu/ojs/index.php/palrap/article/view/249
Y2  - 2021/07/15/09:17:42
L1  - http://palrap.pitt.edu/ojs/index.php/palrap/article/download/249/880
ER  - 

TY  - ELEC
TI  - Strategies for preserving memes as artefacts of digital culture - Fátima García López, Sara Martínez Cardama, 2020
DA  - 2021/07/15/09:18:10
PY  - 2021
UR  - https://journals.sagepub.com/doi/full/10.1177/0961000619882070?casa_token=yFeCRUyk_0UAAAAA%3AE67U1Nv_kZo7XRuJitosk4D4dqCYqqtuJxlk3DvD_34wnceBkzGKyjGTUXl3dmnOXloDRFgqW5kc
Y2  - 2021/07/15/09:18:10
L2  - https://journals.sagepub.com/doi/full/10.1177/0961000619882070?casa_token=yFeCRUyk_0UAAAAA%3AE67U1Nv_kZo7XRuJitosk4D4dqCYqqtuJxlk3DvD_34wnceBkzGKyjGTUXl3dmnOXloDRFgqW5kc
ER  - 

TY  - JOUR
TI  - Polish Web resources described in the "Polish World" directory (1997). Characteristics of domains and their conservation state
AU  - Wilkowski, Marcin
T2  - Archiwa - Kancelarie - Zbiory
AB  - For the purposes of this study, the print version of the  Polish World  directory by Martin Miszczak (Helion, 1997) was used to create an index of historical URLs and verify their current availability and presence in Web archives. The quantitative analysis of the index was prepared  to obtain the rank data on top-level domains (TLDs) and subdomains, while the language of pages published in domains other than .PL was also examined. This study uncovered a low current availability (21.77 per cent) of Polish World URIs with a 79.6 presence in Web archives (60.35 for addresses unreachable today). Forty-six per cent of the addresses from the directory were available on domains other than .PL, of which only 15.36 per cent had content in Polish. It would seem that in 1997, Polish Internet users were able to use Polish-centric resources, mostly already available through the Polish country domain. The 180 domain names with the .PL suffix uncovered during the study constitute around 20 per cent of .PL domain names active until at least the end of 1996 on the Web.
DA  - 2020/12/29/
PY  - 2020
DP  - apcz.umk.pl
VL  - 0
IS  - 11(13)
SP  - 119
EP  - 140
LA  - en
SN  - 2544-5685
UR  - https://apcz.umk.pl/czasopisma/index.php/AKZ/article/view/AKZ.2020.005
Y2  - 2021/07/15/09:19:55
L1  - https://apcz.umk.pl/czasopisma/index.php/AKZ/article/download/AKZ.2020.005/28213
L2  - https://apcz.umk.pl/czasopisma/index.php/AKZ/article/view/AKZ.2020.005/28213
ER  - 

TY  - JOUR
TI  - Keyphrase extraction and its applications to digital libraries
AU  - Patel, Krutarth Indubhai
AB  - Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. Moreover, they are very useful in many applications such as document and citation recommendation, expert search, scientific paper summarization, collaborator recommendation, topic classification, and keyphrase extraction. Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. Furthermore, keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. Keyphrases are useful in many applications such as document indexing and summarization, topic tracking, contextual advertising, and opinion mining. However, keyphrases are not always provided with the papers, but they need to be extracted from their content. A growing number of scholarly digital libraries, museums, and archives around the world are embracing web archiving as a mechanism to collect born-digital material made available via the web. To create the specialized collection from the Web archived data, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection.  
In this dissertation, we first explore keyphrase extraction as a supervised task and formulated as sequence labeling and utilize the power of Conditional Random Fields in capturing label dependencies through a transition parameter matrix consisting of the transition probabilities from one label to the neighboring label. Our proposed CRF-based supervised approach exploits word embeddings as features along with traditional, document-specific features. Our results on five datasets of research papers show that the word embeddings combined with document-specific features achieve high performance and outperform strong baselines for this task. We also propose KPRank, an unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRank. Our experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this task. Furthermore, we  investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX digital library for extracting high-quality keyphrases. 
Further, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ≈ 267,000 unique research papers through our fully-automated framework using ≈ 76,000 queries, resulting in almost 200,000 more papers than the number of queries. Furthermore,  We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. We use Self-Training in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large-scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors' homepages. 
Finally, we explore different learning models and feature representations to determine the best-performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. Moreover, we explore dynamic fusion models to find, on the fly, the model or combination of models that perform best on a variety of document types.  We proposed two dynamic classifier selection algorithms: Dynamic Classifier Selection for Document Classification (or DCSDC), and Dynamic Decision level Fusion for Document Classification (or DDFC). Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on all three datasets.
DA  - 2021///
PY  - 2021
DP  - krex.k-state.edu
LA  - en_US
UR  - https://krex.k-state.edu/dspace/handle/2097/41306
Y2  - 2021/07/15/09:20:16
L1  - https://krex.k-state.edu/dspace/bitstream/2097/41306/5/KrutarthPatel2021.pdf
L2  - https://krex.k-state.edu/dspace/handle/2097/41306
ER  - 

TY  - JOUR
TI  - Perma.cc and Web Archival Dissonance with Copyright Law
AU  - Callister, Paul Douglas
T2  - Legal Reference Services Quarterly
AB  - Harvard’s Perma.cc offers the solution to link rot—the phenomenon that citations in academic journals to Web materials disappear with the passage of time, resulting in “broken links” and disappearance of material from the Web. This article will describe Perma.cc and outline the kinds of copyright issues that may arise, including heavy use of copyright statutes and case law. It will examine the kind of preservation use of copyrighted materials, with reference to fair use, and the library prerogatives as exceptions to the exclusive rights of authors of materials found on the Web. This analysis includes detailed analysis of “transformative use” and the four factors of 17 U.S.C. § 107. It will consider the liability of Perma.cc and participating libraries and institutions under theories of contributory infringement and vicarious liability, including as modified by 17 U.S.C. § 512(c) and (d), governing takedown notices. The article concludes that Perma.cc’s archival use is neither firmly grounded in existing fair use nor library exemptions; that Perma.cc, its “registrar” library, institutional affiliates, and its contributors have some (at least theoretical) exposure to risk; and that current copyright doctrines and law do not adequately address Web archival storage for scholarly purposes. In doing so, it will question what the role of the scholarly Perma.cc citation ought to play—confirmation of scholarly propositions or preservation of and access to Web materials. The material and conclusions in this article are important for legal authors, law review editors, and librarians (especially those who use, support, or are considering partnering with Perma.cc) so that they might better assess copyright compliance, especially when selecting materials for archiving, such as articles from news sites, blogs, and professional and scholarly papers, articles, or books.
DA  - 2021/00/02/
PY  - 2021
DO  - 10.1080/0270319X.2021.1886785
DP  - Taylor and Francis+NEJM
VL  - 40
IS  - 1
SP  - 1
EP  - 57
SN  - 0270-319X
UR  - https://doi.org/10.1080/0270319X.2021.1886785
Y2  - 2021/07/15/09:20:43
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=150872490&S=R&D=lxh&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSsqm4TLaWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
L2  - https://www.tandfonline.com/doi/abs/10.1080/0270319X.2021.1886785
KW  - web archiving
KW  - link rot
KW  - copyright
KW  - broken links
KW  - fair use
KW  - contributory infringement
KW  - Contributory infringement (Copyright & trademark)
KW  - Fair use (Copyright)
KW  - library exceptions
KW  - Perma.cc
KW  - Scholarly periodicals
KW  - transformative use
KW  - vicarious liability
ER  - 

TY  - JOUR
TI  - You say potato, I say potato Mapping Digital Preservation and Research Data Management Concepts towards Collective Curation and Preservation Strategies
AU  - Lindlar, Michelle
AU  - Rudnik, Pia
AU  - Jones, Sarah
AU  - Horton, Laurence
T2  - International Journal of Digital Curation
AB  - This paper explores models, concepts and terminology used in the Research Data Management and Digital Preservation communities. In doing so we identify several overlaps and mutual concerns where the advancements of one professional field can apply to and assist another. By focusing on what unites rather than divides us, and by adopting a more holistic approach we advance towards collective curation and preservation strategies.
DA  - 2020/12/30/
PY  - 2020
DO  - 10.2218/ijdc.v15i1.728
DP  - www.ijdc.net
VL  - 15
IS  - 1
SP  - 26
LA  - en
SN  - 1746-8256
UR  - http://www.ijdc.net/article/view/728
Y2  - 2021/07/15/09:22:35
L1  - http://www.ijdc.net/article/download/728/591
KW  - digital preservation
KW  - preservation
KW  - digital curation
KW  - curation
KW  - DCC
KW  - IJDC
KW  - International Journal of Digital Curation
ER  - 

TY  - JOUR
TI  - History’s Future in the Age of the Internet
AU  - Story, Daniel J.
AU  - Guldi, Jo
AU  - Hitchcock, Tim
AU  - Moravec, Michelle
T2  - The American Historical Review
AB  - Ian Milligan’s History in the Age of Abundance? How the Web Is Transforming Historical Research (2019) presents and interrogates the challenges and opportunities that born-digital materials have for historians. Milligan argues that historians who wish to grapple with the archived internet need to think much more aggressively about engaging with digital methods and tools that can complement and extend the well-honed practices of close reading with approaches that can help analyze the vast and often unstructured archives of internet data. In this AHR Review Roundtable, three historians—Jo Guldi, Tim Hitchcock, and Michelle Moravec, all of whom incorporate digital approaches and concerns into their work—engage with a set of questions developed by Digital Scholarship Librarian Daniel J. Story, to discuss Milligan’s treatment of the digital archive of the web and its implications for historians’ work. Milligan offers a response to these insights and critiques, emphasizing the need for the historical discipline to change from within and build upon its valuable qualities.
DA  - 2020/10/21/
PY  - 2020
DO  - 10.1093/ahr/rhaa477
DP  - Silverchair
VL  - 125
IS  - 4
SP  - 1337
EP  - 1346
J2  - The American Historical Review
SN  - 0002-8762
UR  - https://doi.org/10.1093/ahr/rhaa477
Y2  - 2021/07/15/09:23:50
L2  - https://academic.oup.com/ahr/article-abstract/125/4/1337/5933592
ER  - 

TY  - CHAP
TI  - Automatic Generation of Timelines for Past-Web Events
AU  - Campos, Ricardo
AU  - Pasquali, Arian
AU  - Jatowt, Adam
AU  - Mangaravite, Vítor
AU  - Jorge, Alípio Mário
T2  - The Past Web: Exploring Web Archives
A2  - Gomes, Daniel
A2  - Demidova, Elena
A2  - Winters, Jane
A2  - Risse, Thomas
AB  - Despite significant advances in web archive infrastructures, the problem of exploring the historical heritage preserved by web archives is yet to be solved. Timeline generation emerges in this context as one possible solution for automatically producing summaries of news over time. Thanks to this, users can gain a better sense of reported news events, entities, stories or topics over time, such as getting a summary of the most important news about a politician, an organisation or a locality. Web archives play an important role here by providing access to a historical set of preserved information. This particular characteristic of web archives makes them an irreplaceable infrastructure and a valuable source of knowledge that contributes to the process of timeline generation. Accordingly, the authors of this chapter developed “Tell me Stories” (http://archive.tellmestories.pt), a news summarisation system, built on top of the infrastructure of Arquivo.pt—the Portuguese web-archive—to automatically generate a timeline summary of a given topic. In this chapter, we begin by providing a brief overview of the most relevant research conducted on the automatic generation of timelines for past-web events. Next, we describe the architecture and some use cases for “Tell me Stories”. Our system demonstrates how web archives can be used as infrastructures to develop innovative services. We conclude this chapter by enumerating open challenges in this field and possible future directions in the general area of temporal summarisation in web archives.
CY  - Cham
DA  - 2021///
PY  - 2021
DP  - Springer Link
SP  - 225
EP  - 242
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-63291-5
UR  - https://doi.org/10.1007/978-3-030-63291-5_18
Y2  - 2021/07/15/09:24:26
ER  - 

TY  - JOUR
TI  - Collaborative collection development: current perspectives leading to future initiatives
AU  - Levenson, Helen N.
AU  - Nichols Hess, Amanda
T2  - The Journal of Academic Librarianship
AB  - As academic libraries continue to face acquisition budget challenges, collaborative collection development (CCD) offers greater opportunities to fulfill the core role of library collecting and collection management, namely, to provide enhanced access to the widest variety of relevant resources in the most cost-responsible manner possible. Libraries have successfully implemented CCD projects of various types, and as a result, have achieved these needed cost savings. The authors conducted survey research to investigate current CCD activities and librarians' perceptions of its benefits, drawbacks, elements contributing to successful CCD programs, and possible obstacles to success. Library collections consist of a variety of material formats and librarians have applied CCD models to maintain needed access to these resources, shifting from ownership to access, all in support of building collective collections. The survey results found that, although challenges can exist, application of CCD activities have realized substantial benefits, financial and otherwise, for academic libraries overall.
DA  - 2020/09/01/
PY  - 2020
DO  - 10.1016/j.acalib.2020.102201
DP  - ScienceDirect
VL  - 46
IS  - 5
SP  - 102201
J2  - The Journal of Academic Librarianship
LA  - en
SN  - 0099-1333
ST  - Collaborative collection development
UR  - https://www.sciencedirect.com/science/article/pii/S009913332030104X
Y2  - 2021/07/15/09:24:41
L1  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7366992/pdf/main.pdf
L2  - https://www.sciencedirect.com/science/article/abs/pii/S009913332030104X
KW  - Collaborative collection development
KW  - Collection management
KW  - Collective collections
KW  - Cooperative collection development
KW  - Coordinated collection development
KW  - Survey research
ER  - 

TY  - CHAP
TI  - Digital humanities preservation: A conversation for developing sustainable digital projects
AU  - Miller, A.
AU  - Taylor-Poleskey, Molly
T2  - Transformative Digital Humanities
AB  - This chapter describes an urgent need to transform practice in digital humanities scholarship to include preservation at the forefront of digital project planning. Typical obstacles to an effective preservation plan include lack of time and funding, the structure of digital scholarship grants, lack of a culture of collaboration outside of the humanities, and uncertain ownership of projects after the end of implementation. Creating and teaching DH typically take precedence over thought about project preservation. This chapter argues instead that working with an interdisciplinary team on a preservation plan at the outset of a project will greatly increase the longevity and reproducibility of digital projects. Additionally, this chapter offers one way to spark the preservation conversation with an accompanying Preservation Plan template. Ideally, the principles of preservation will become an expected part of DH education to circumvent lost projects due to lack of technical, human, institutional, and financial support.
DA  - 2020///
PY  - 2020
PB  - Routledge
SN  - 978-0-429-39992-3
ST  - Digital humanities preservation
ER  - 

TY  - JOUR
TI  - Current research on theory and practice of digital libraries: best papers from TPDL 2017
AU  - Tsakonas, Giannis
AU  - Kamps, Jaap
T2  - International Journal on Digital Libraries
AB  - This volume presents a special issue on the 2017 edition of the Theory and Practice of Digital Libraries (TPDL) conference, held in Thessaloniki, Greece. We provide a brief overview of TPDL 2017 and introduce the selected papers that make up the rest of this volume. The papers cover diﬀerent aspects of current digital library research, highlighting the important and multidisciplinary nature of the ﬁeld.
DA  - 2020/03//
PY  - 2020
DO  - 10.1007/s00799-020-00278-4
DP  - DOI.org (Crossref)
VL  - 21
IS  - 1
SP  - 1
EP  - 3
J2  - Int J Digit Libr
LA  - en
SN  - 1432-5012, 1432-1300
ST  - Current research on theory and practice of digital libraries
UR  - http://link.springer.com/10.1007/s00799-020-00278-4
Y2  - 2021/07/15/09:26:02
L1  - https://e.humanities.uva.nl/publications/2020/tsak_curr20.pdf
ER  - 

TY  - JOUR
TI  - How Can We Be Ready to Study History in the Age of Abundance? A Response
AU  - Milligan, Ian
T2  - The American Historical Review
AB  - Ian Milligan’s History in the Age of Abundance? How the Web Is Transforming Historical Research (2019) presents and interrogates the challenges and opportunities that born-digital materials have for historians. Milligan argues that historians who wish to grapple with the archived internet need to think much more aggressively about engaging with digital methods and tools that can complement and extend the well-honed practices of close reading with approaches that can help analyze the vast and often unstructured archives of internet data. In this AHR Review Roundtable, three historians—Jo Guldi, Tim Hitchcock, and Michelle Moravec, all of whom incorporate digital approaches and concerns into their work—engage with a set of questions developed by Digital Scholarship Librarian Daniel J. Story, to discuss Milligan’s treatment of the digital archive of the web and its implications for historians’ work. Milligan offers a response to these insights and critiques, emphasizing the need for the historical discipline to change from within and build upon its valuable qualities.
DA  - 2020/10/21/
PY  - 2020
DO  - 10.1093/ahr/rhaa478
DP  - Silverchair
VL  - 125
IS  - 4
SP  - 1347
EP  - 1349
J2  - The American Historical Review
SN  - 0002-8762
ST  - How Can We Be Ready to Study History in the Age of Abundance?
UR  - https://doi.org/10.1093/ahr/rhaa478
Y2  - 2021/07/15/09:26:19
L2  - https://academic.oup.com/ahr/article-abstract/125/4/1347/5933597
ER  - 

TY  - JOUR
TI  - An Exploratory Study of Advantages and Disadvantages of Website Preservation
AU  - Handisa, Rattahpinnusa Haresariu
T2  - Record and Library Journal
DA  - 2021/06/29/
PY  - 2021
DO  - 10.20473/rlj.v7i1.113
DP  - e-journal3.unair.ac.id
VL  - 7
IS  - 1
SP  - 1
EP  - 6
LA  - en
SN  - 2442-5168
UR  - https://e-journal3.unair.ac.id/index.php/rlj/article/view/113
Y2  - 2021/07/15/09:26:47
L1  - https://e-journal3.unair.ac.id/index.php/rlj/article/download/113/59
KW  - Accessible website
ER  - 

TY  - JOUR
TI  - Comparison of Web Services for Sentiment Analysis in Social Networking Sites
AU  - Basmmi, Ain Balqis Md Nor
AU  - Halim, Shahliza Abd
AU  - Saadon, Nor Azizah
T2  - IOP Conference Series: Materials Science and Engineering
AB  - With various type of web services available, it is hard to identify and compare which of the free access web services work best in analysing sentiment of extremist content in social networking sites. For that purpose, a generic approach by working with API of web service using PHP programming language is used to test each dataset that was extracted based on the keyword ‘extremism’. Data from both Twitter and Facebook has been used as these two are the most powerful platforms for expressing one’s feeling. The comparison for web service is done based on the analysis of its accuracy, precision, recall and f-measures in obtaining the lowest score of mean square error (MSE). Four sentiment analysis web services are used which are Sentiment Analyzer, Aylien, ParallelDots, and MonkeyLearn. From the comparison, MonkeyLearn obtained the best final results among all web services with the lowest MSE score of 14%. For the benefit of other researchers, the finding of this will reveal the suitable web service for analysing sentiment issues as critical as extremism.
DA  - 2020/07//
PY  - 2020
DO  - 10.1088/1757-899X/884/1/012063
DP  - Institute of Physics
VL  - 884
SP  - 012063
J2  - IOP Conf. Ser.: Mater. Sci. Eng.
LA  - en
SN  - 1757-899X
UR  - https://doi.org/10.1088/1757-899x/884/1/012063
Y2  - 2021/07/15/09:27:21
L1  - https://iopscience.iop.org/article/10.1088/1757-899X/884/1/012063/pdf
ER  - 

TY  - JOUR
TI  - Building NED: Open Access to Australia’s Digital Documentary Heritage
AU  - Lemon, Barbara
AU  - Blinco, Kerry
AU  - Somes, Brendan
T2  - Publications
AB  - This article charts the development of Australia&rsquo;s national edeposit service (NED), from concept to reality. A world-first collaboration between the national, state and territory libraries of Australia, NED was launched in 2019 and transformed our approach to legal deposits in Australia. NED is more than a repository, operating as a national online service for depositing, preserving and accessing Australian electronic publications, with benefits to publishers, libraries and the public alike. This article explains what makes NED unique in the context of global research repository infrastructure, outlining the ways in which NED member libraries worked to balance user needs with technological capacity and the variations within nine sets of legal deposit legislation.
DA  - 2020/06//
PY  - 2020
DO  - 10.3390/publications8020019
DP  - www.mdpi.com
VL  - 8
IS  - 2
SP  - 19
LA  - en
ST  - Building NED
UR  - https://www.mdpi.com/2304-6775/8/2/19
Y2  - 2021/07/15/09:27:40
L1  - https://www.mdpi.com/2304-6775/8/2/19/pdf
L2  - https://www.mdpi.com/2304-6775/8/2/19
KW  - legal deposit
KW  - digital heritage
KW  - Australia
KW  - electronic publications
KW  - open repository
ER  - 

TY  - JOUR
TI  - The Internet Archive
AU  - Gratzinger, Ollie
T2  - American Journalism
DA  - 2021/04/03/
PY  - 2021
DO  - 10.1080/08821127.2021.1912531
DP  - Taylor and Francis+NEJM
VL  - 38
IS  - 2
SP  - 249
EP  - 251
SN  - 0882-1127
UR  - https://doi.org/10.1080/08821127.2021.1912531
Y2  - 2021/07/15/09:28:40
L2  - https://www.tandfonline.com/doi/abs/10.1080/08821127.2021.1912531?journalCode=uamj20
ER  - 

TY  - JOUR
TI  - The Iterative Design and Evaluation Approach for a Socially-aware Search and Retrieval Application for Digital Archiving
AU  - Spiliotopoulos, Dimitris
AU  - Frey, Dominik
AU  - Bouwmeester, Ruben
AU  - Welle, Deutsche
AU  - Kouroupetroglou, Georgios
AU  - Stavropoulou, Pepi
AB  - Designing user interfaces involves several iterations for usability design and evaluation as well as incremental functionality integration and testing. This paper reports on the methodological approach for the design and implementation of an application that is used for search and retrieval of sociallyaware digital content. It presents the archivist view of professional media organizations and the specific requirements for successful retrieval of content. The content derived from the social media analysis is enormous and appropriate actions need to be taken to avoid irrelevant and/or repeated social information in the displayed results as well as overinformation. The archivist feedback reveals the way humans address the social information as presented in the form of metadata along with the archived raw content and how this drives the design of a dedicated search and retrieval application.
DA  - 2013///
PY  - 2013
DP  - Zotero
SP  - 5
LA  - en
L1  - https://d1wqtxts1xzle7.cloudfront.net/40052615/achi_2013_6_40_20345.pdf?1447661436=&response-content-disposition=inline%3B+filename%3DThe_Iterative_Design_and_Evaluation_Appr.pdf&Expires=1626344947&Signature=TqHbbPPVwprv8JZnx7iJvYlJwnJn13YkDYz-R5sYmICBovUlz8XDwbB8tWAcYXKdMY2-MDqGNcxfN-uT3VKdhaTuoh4pM9WolkA0Ikk~Uq3HOGdB6V-iFOivvvvpW-cFGo-MhO1HmKtAZebZzWlXM-YLQujP8xZybqTtGBq9~U8hPvBfsM~IDNkRIaB1ZKf3APVO2kQM9sqllUegaD2Ei~phCOfyIpA47e1Ed931xJKjBX3OmD62bSpp~IAMlz1Kv0~k0yEr0qSaYzYMwUel0gkn7uOklii0-5kDX~usICsrbsoHc9hrt4qDeYc2848MZWIvELJJE7hFGj3qUofMbA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA
ER  - 

TY  - ELEC
TI  - Bootstrapping Web Archive Collections from Micro-Collections in Social Media - ProQuest
AB  - Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.
DA  - 2021/07/15/09:29:35
PY  - 2021
LA  - hu
UR  - https://www.proquest.com/openview/80407e8fdb55962153496efb7f9dee24/1?pq-origsite=gscholar&cbl=18750&diss=y
Y2  - 2021/07/15/09:29:35
L2  - https://www.proquest.com/openview/80407e8fdb55962153496efb7f9dee24/1?pq-origsite=gscholar&cbl=18750&diss=y
ER  - 

TY  - JOUR
TI  - Building community at distance: a datathon during COVID-19
AU  - Fritz, Samantha
AU  - Milligan, Ian
AU  - Ruest, Nick
AU  - Lin, Jimmy
T2  - Digital Library Perspectives
AB  - Purpose This paper aims to use the experience of an in-person event that was forced to go virtual in the wake of COVID-19 as an entryway into a discussion on the broader implications around transitioning events online. It gives both practical recommendation to event organizers as well as broader reflections on the role of digital libraries during the COVID-19 pandemic and beyond. Design/methodology/approach The authors draw on their personal experiences with the datathon, as well as a comprehensive review of literature. The authors provide a candid assessment of what approaches worked and which ones did not. Findings A series of best practices are provided, including factors for assessing whether an event can be run online; the mixture of synchronous versus asynchronous content; and important technical questions around delivery. Focusing on a detailed case study of the shift of the physical team-building exercise, the authors note how cloud-based platforms were able to successfully assemble teams and jumpstart online collaboration. The existing decision to use cloud-based infrastructure facilitated the event’s transition as well. The authors use these examples to provide some broader insights on meaningful content delivery during the COVID-19 pandemic. Originality/value Moving an event online during a novel pandemic is part of a broader shift within the digital libraries’ community. This paper thus provides a useful professional resource for others exploring this shift, as well as those exploring new program delivery in the post-pandemic period (both due to an emphasis on climate reduction as well as reduced travel budgets in a potential period of financial austerity).
DA  - 2020/01/01/
PY  - 2020
DO  - 10.1108/DLP-04-2020-0024
DP  - Emerald Insight
VL  - 36
IS  - 4
SP  - 415
EP  - 428
SN  - 2059-5816
ST  - Building community at distance
UR  - https://doi.org/10.1108/DLP-04-2020-0024
Y2  - 2021/07/15/09:30:58
L1  - https://www.emerald.com/insight/content/doi/10.1108/DLP-04-2020-0024/full/pdf?title=building-community-at-distance-a-datathon-during-covid-19
KW  - Web archives
KW  - COVID-19
KW  - Datathon
KW  - Interdisciplinary
KW  - Online events
KW  - Team formation
ER  - 

TY  - SLIDE
TI  - TMVis: Visualizing Webpage Changes Over Time
T2  - Web Archiving & Digital Libraries Workshop 2020, Virtual, August 5, 2020
A2  - Mabe, Abigail
A2  - Patel, Dhruv
A2  - Gunnam, Maheedhar
A2  - Shankar, Surbhi
A2  - Kelly, Mat
A2  - Alam, Sawood
A2  - Nelson, Michael L
A2  - Weigle, Michele C
AB  - TMVis is a web service to provide visualizations of how individual webpages have changed over time. We leverage past research on summarizing collections of webpages with thumbnail-sized screenshots and on choosing a small number of representative archived webpages from a large collection. We o�er four visualizations: Image Grid, Image Slider, Timeline, and Animated GIF. Embed codes for the Image Grid and Image Slider can be produced to include these visualizations on separate webpages. This tool can be used to allow scholars from various disciplines, as well as the general public, to explore the temporal nature of webpages.
CY  - Old Dominion University
DA  - 2020///
PY  - 2020
LA  - en
UR  - https://digitalcommons.odu.edu/computerscience_fac_pubs/174/
Y2  - 2021/08/06/
L1  - https://digitalcommons.odu.edu/cgi/viewcontent.cgi?article=1176&context=computerscience_fac_pubs
ER  - 

TY  - JOUR
TI  - Webarchivierung in NRW aus Sicht der Universitäts- und Landesbibliothek Münster
AU  - Ammendola, Andrea Pietro
AB  - Die ULB Münster hat als eine der drei Landesbibliotheken in NRW im Rahmen des Pflichtexemplargesetzes den Auftrag, seit 2013 nicht nur körperliche Medien zu sammeln und zu archivieren, sondern auch sogenannte unkörperliche Medien in Form elektronischer Publikationen im Netz, aber auch Webseiten, die für Westfalen relevant sind. Genau hier setzt die Masterarbeit an und untersucht, wie unter den gegebenen Bedingungen in Münster und mit welchen Möglichkeiten sich der gesetzliche Auftrag der Webarchivierung idealerweise und nachhaltig umsetzen lässt. Eine weitere Fragestellung lautet, welche Aufgaben die ULB in diesem Prozess eigenständig bearbeiten kann und für welche Kooperationen sinnvoll sein können. Als Ergebnisse wurden hierfür konzeptionelle Empfehlungen erarbeitet, sowohl im Hinblick auf eine Öffnung des Sammelprofils als auch auf eine kooperative Lösung im Rahmen eines angestrebten Deutschen Webarchivs.
DA  - 2020///
PY  - 2020
DP  - publiscologne.th-koeln.de
LA  - en
UR  - https://publiscologne.th-koeln.de/frontdoor/index/index/docId/1621
Y2  - 2021/07/15/09:33:20
L1  - https://publiscologne.th-koeln.de/files/1621/MA_Ammendola_Andrea.pdf
L2  - https://publiscologne.th-koeln.de/frontdoor/index/index/docId/1621
ER  - 

TY  - JOUR
TI  - Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy
AU  - Asikri, M. El
AU  - Krit, S.
AU  - Chaib, H.
T2  - European Journal of Molecular & Clinical Medicine
AB  - Web scraping, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.In this paper, among others kind of scraping, we focus on those techniques that extract the content of a Web page. In particular, we adopt scraping techniques in the Web e-commerce field. To this end, we propose a solution aimed at analyzing data extraction to exploiting Web scraping using python and scrapy framework .
DA  - 2020/11/01/
PY  - 2020
DP  - ejmcm.com
VL  - 7
IS  - 3
SP  - 433
EP  - 442
UR  - https://ejmcm.com/article_1525.html
Y2  - 2021/07/15/09:36:16
L1  - https://ejmcm.com/article_1525_ba229fbd0edcc979792f06aca3cd4f81.pdf
L2  - https://ejmcm.com/article_1525.html
ER  - 

TY  - JOUR
TI  - Copyright and Preservation of Born-digital Materials: Persistent Challenges and Selected Strategies
AU  - Fisher, Katherine
T2  - The American Archivist
AB  - This article surveys and analyzes archival literature and legal resources (primarily United States–focused) related to copyright considerations that archivists and other content managers must be aware of to effectively and legally maintain a collection of born-digital materials. These considerations include the centrality of copying to preservation actions, shifting definitions of ownership, unclear distinctions between published and unpublished content, digital rights management laws and technologies, and the layered copyrights that can exist in complex digital objects and their dependencies. Strategies for dealing with these challenges include securing rights ahead of time, adopting legal rationales related to orphan works and fair use, adapting practices from specialized digital preservation subfields, ensuring routine procedures adequately address copyright-related recordkeeping and risk management, and advocating for preservation-enabling copyright reforms. An examination of these issues and strategies in the context of current thinking about copyright suggests that while certain legal exceptions and existing rights frameworks can help to facilitate digital preservation activities, copyright will continue to be a barrier until significant reforms are enacted.
DA  - 2021/03/08/
PY  - 2021
DO  - 10.17723/0360-9081-83.2.238
DP  - Silverchair
VL  - 83
IS  - 2
SP  - 238
EP  - 267
J2  - The American Archivist
SN  - 0360-9081
ST  - Copyright and Preservation of Born-digital Materials
UR  - https://doi.org/10.17723/0360-9081-83.2.238
Y2  - 2021/07/15/09:37:23
L2  - https://meridian.allenpress.com/american-archivist/article-abstract/83/2/238/462517/Copyright-and-Preservation-of-Born-digital
ER  - 

TY  - JOUR
TI  - Towards extracting event-centric collections from Web archives
AU  - Gossen, Gerhard
AU  - Risse, Thomas
AU  - Demidova, Elena
T2  - International Journal on Digital Libraries
AB  - Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.
DA  - 2020/03/01/
PY  - 2020
DO  - 10.1007/s00799-018-0258-6
DP  - Springer Link
VL  - 21
IS  - 1
SP  - 31
EP  - 45
J2  - Int J Digit Libr
LA  - en
SN  - 1432-1300
UR  - https://doi.org/10.1007/s00799-018-0258-6
Y2  - 2021/07/15/09:38:37
L4  - http://link.springer.com/10.1007/s00799-018-0258-6
KW  - Web archives
KW  - Focused crawling
KW  - Event-centric document collections
ER  - 

TY  - JOUR
TI  - Innovation on the web: the end of the S-curve?
AU  - Priestley, Maria
AU  - Sluckin, T. J.
AU  - Tiropanis, Thanassis
T2  - Internet Histories
AB  - Rigorous research into the historical past of Web technology-driven innovation becomes timely as technological growth and forecasting are attracting popular interest. Drawing on economic and management literature relating to the typical trends of technological innovation, we examine the long-term development of Web technology in a theoretically informed and empirical manner. An original longitudinal dataset of 20,493 Web-related US patents is used to trace the growth curve of Web technology between the years of 1990 through 2013. We find that the accumulation of corporate Web inventions followed an S-shaped curve which shifted to linear growth after year 2004. This transition is unusual in relation to the traditional S-curve model of technological development that typically approaches a limit. The point of inflection on the S-curve coincided reasonably closely with the timing of the dot-com crash in year 2000. Moreover, we find a complex bi-directional relationship between patenting rates in Web technology and movements in the NASDAQ composite stock index. The implications of these results are discussed in theoretical and practical terms for sustained technological growth. Specific recommendations for different stakeholders in commercial Web development are included.
DA  - 2020/10/01/
PY  - 2020
DO  - 10.1080/24701475.2020.1747261
DP  - Taylor and Francis+NEJM
VL  - 4
IS  - 4
SP  - 390
EP  - 412
SN  - 2470-1475
ST  - Innovation on the web
UR  - https://doi.org/10.1080/24701475.2020.1747261
Y2  - 2021/07/15/09:45:07
L1  - https://www.tandfonline.com/doi/pdf/10.1080/24701475.2020.1747261
L2  - https://www.tandfonline.com/doi/full/10.1080/24701475.2020.1747261
KW  - empirical measurement
KW  - innovation
KW  - patents
KW  - technological revolutions
KW  - Web technology
ER  - 

TY  - JOUR
TI  - MementoEmbed and Raintale for Web Archive Storytelling
AU  - Jones, Shawn M.
AU  - Klein, Martin
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
T2  - arXiv:2008.00137 [cs]
AB  - For traditional library collections, archivists can select a representative sample from a collection and display it in a featured physical or digital library space. Web archive collections may consist of thousands of archived pages, or mementos. How should an archivist display this sample to drive visitors to their collection? Search engines and social media platforms often represent web pages as cards consisting of text snippets, titles, and images. Web storytelling is a popular method for grouping these cards in order to summarize a topic. Unfortunately, social media platforms are not archive-aware and fail to consistently create a good experience for mementos. They also allow no UI alterations for their cards. Thus, we created MementoEmbed to generate cards for individual mementos and Raintale for creating entire stories that archivists can export to a variety of formats.
DA  - 2020/07/31/
PY  - 2020
DP  - arXiv.org
UR  - http://arxiv.org/abs/2008.00137
Y2  - 2021/07/15/09:45:33
L1  - https://arxiv.org/pdf/2008.00137.pdf
L2  - https://arxiv.org/abs/2008.00137
KW  - Computer Science - Digital Libraries
KW  - H.3.7
KW  - Computer Science - Human-Computer Interaction
KW  - Computer Science - Information Retrieval
KW  - H.3.4
KW  - H.3.6
ER  - 

TY  - JOUR
TI  - What Did It Look Like: A service for creating website timelapses using the Memento framework
AU  - Patel, Dhruv
AU  - Nwala, Alexander C.
AU  - Nelson, Michael L.
AU  - Weigle, Michele C.
T2  - arXiv:2104.14041 [cs]
AB  - Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. Originally implemented in 2015, we recently added new features to WDILL, such as date range requests, diversified memento selection, updated visualizations, and sharing visualizations to Instagram. This would allow scholars and the general public to explore the temporal nature of web archives.
DA  - 2021/04/28/
PY  - 2021
DP  - arXiv.org
ST  - What Did It Look Like
UR  - http://arxiv.org/abs/2104.14041
Y2  - 2021/07/15/09:47:15
L1  - https://arxiv.org/pdf/2104.14041.pdf
L1  - https://arxiv.org/pdf/2104.14041.pdf
L2  - https://arxiv.org/abs/2104.14041
L2  - https://arxiv.org/abs/2104.14041
KW  - Computer Science - Digital Libraries
ER  - 

TY  - CHAP
TI  - Preservation of website records
AU  - Schenkolewski-Kroll, Silvia
AU  - Tractinsky, Assaf
T2  - Trust and Records in an Open Digital Environment
AB  - This chapter examines the process of archival retention and disposition of an Israeli government website – the website of the Ministry of Foreign Affairs. It shows a system of methodologies and procedures for retention and disposition of records preserved on the website in a government cloud, in accordance with the regulations and guidelines included in the Israel Archives Law (1955) and discusses the improvements needed. The differences between the various records on the website are indicated and compared with administrative records according to their nature and presentation, contents, and use. For this purpose, the authors discuss the use of statistical analysis of visitors’ behaviour on the website in the process of appraisal. In light of these facts, a formula is presented that, among other elements, combines the number of returning visitors to a section of the website with other data. The purpose is to achieve appraisal metrics that would provide the possibility of comparing different sections from the aspect of the users, as well as from the geographical aspect. The chapter shows the results of the research on the use of web analytics of users’ behaviour as a tool for appraisal of website records. Finally, the metadata necessary for appraisal and preservation of entire websites, as well as for their various sections, are recommend.
DA  - 2020///
PY  - 2020
PB  - Routledge
SN  - 978-1-00-300511-7
ER  - 

TY  - CHAP
TI  - Web Scraping in Python Using Beautiful Soup Library
AU  - Patel, Jay M.
T2  - Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale
A2  - Patel, Jay M.
AB  - In this chapter, we’ll go through the basic building blocks of web pages such as HTML and CSS and demonstrate scraping structured information from them using popular Python libraries such as Beautiful Soup and lxml. Later, we’ll expand our knowledge and tackle issues that will make our scraper into a full-featured web crawler capable of fetching information from multiple web pages.
CY  - Berkeley, CA
DA  - 2020///
PY  - 2020
DP  - Springer Link
SP  - 31
EP  - 84
LA  - en
PB  - Apress
SN  - 978-1-4842-6576-5
UR  - https://doi.org/10.1007/978-1-4842-6576-5_2
Y2  - 2021/07/15/09:50:50
ER  - 

TY  - BOOK
TI  - Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale
AU  - Patel, Jay M.
CY  - Berkeley, CA
DA  - 2020///
PY  - 2020
DP  - DOI.org (Crossref)
LA  - en
PB  - Apress
SN  - 978-1-4842-6575-8 978-1-4842-6576-5
ST  - Getting Structured Data from the Internet
UR  - http://link.springer.com/10.1007/978-1-4842-6576-5
Y2  - 2021/07/15/09:51:31
ER  - 

TY  - BOOK
TI  - The Future of Digital Data, Heritage and Curation: in a More-than-Human World
AU  - Cameron, Fiona R.
AB  - The Future of Digital Data, Heritage and Curation critiques digital cultural heritage concepts and their application to data, developing new theories, curatorial practices and a more-than-human museology for a contemporary and future world. 
Presenting a diverse range of case examples from around the globe, Cameron offers a critical and philosophical reflection on the ways in which digital cultural heritage is currently framed as societal data worth passing on to future generations in two distinct forms: digitally born and digitizations. Demonstrating that most perceptions of digital cultural heritage are distinctly western in nature, the book also examines the complicity of such heritage in climate change, and environmental destruction and injustice. Going further still, the book theorizes the future of digital data, heritage, curation and the notion of the human in the context of the profusion of new types of societal data and production processes driven by the intensification of data economies and through the emergence of new technologies. In so doing, the book makes a case for the development of new types of heritage that comprise AI, automated systems, biological entities, infrastructures, minerals and chemicals – all of which have their own forms of agency, intelligence and cognition. 
The Future of Digital Data, Heritage and Curation is essential reading for academics and students engaged in the study of museums, archives, libraries, galleries, archaeology, cultural heritage management, information management, curatorial studies and digital humanities.
CY  - London
DA  - 2021/03/31/
PY  - 2021
SP  - 308
PB  - Routledge
SN  - 978-1-00-314960-6
ST  - The Future of Digital Data, Heritage and Curation
ER  - 

TY  - CONF
TI  - Modeling Updates of Scholarly Webpages Using Archived Data
AU  - Jayawardana, Yasith
AU  - Nwala, Alexander C.
AU  - Jayawardena, Gavindya
AU  - Wu, Jian
AU  - Jayarathna, Sampath
AU  - Nelson, Michael L.
AU  - Lee Giles, C.
T2  - 2020 IEEE International Conference on Big Data (Big Data)
AB  - The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors' homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
C3  - 2020 IEEE International Conference on Big Data (Big Data)
DA  - 2020/12//
PY  - 2020
DO  - 10.1109/BigData50022.2020.9377796
DP  - IEEE Xplore
SP  - 1868
EP  - 1877
L1  - https://arxiv.org/pdf/2012.03397
L2  - https://ieeexplore.ieee.org/abstract/document/9377796
KW  - History
KW  - Internet
KW  - Search engines
KW  - Big Data
KW  - Crawl Scheduling
KW  - Data models
KW  - Frequency estimation
KW  - Portable document format
KW  - Search Engines
KW  - Web Crawling
ER  - 

TY  - CONF
TI  - Social Cards Probably Provide For Better Understanding Of Web Archive Collections
AU  - Jones, Shawn M.
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
T3  - CIKM '19
AB  - Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page. Search engines and social media have a different focus, and hence produce different surrogates than web archives. Search engine surrogates help a user answer the question "Will this link meet my information need?" Social media surrogates help a user decide "Should I click on this?" Our use case is subtly different. We hypothesize that groups of surrogates together are useful for summarizing a collection. We want to help users answer the question of "What does the underlying collection contain?" But which surrogate should we use? With Mechanical Turk participants, we evaluate six different surrogate types against each other. We find that the type of surrogate does not influence the time to complete the task we presented the participants. Of particular interest are social cards, surrogates typically found on social media, and browser thumbnails, screen captures of web pages rendered in a browser. At p=0.0569, and p=0.0770, respectively, we find that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform. We measure user interactions with each surrogate and find that users interact with social cards less than other types. The results of this study have implications for our web archive summarization work, live web curation platforms, social media, and more.
C1  - New York, NY, USA
C3  - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
DA  - 2019/00/03/
PY  - 2019
DO  - 10.1145/3357384.3358039
DP  - ACM Digital Library
SP  - 2023
EP  - 2032
PB  - Association for Computing Machinery
SN  - 978-1-4503-6976-3
UR  - https://doi.org/10.1145/3357384.3358039
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1145/3357384.3358039
KW  - web archives
KW  - user studies
KW  - collection summarization
KW  - mechanical turk
KW  - social cards
KW  - thumbnails
KW  - web archive collections
KW  - web page surrogates
ER  - 

TY  - CONF
TI  - Using micro-collections in social media to generate seeds for web archive collections
AU  - Nwala, Alexander C.
AU  - Weigle, Michele C.
AU  - Nelson, Michael L.
T3  - JCDL '19
AB  - In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but this ability comes at a cost: it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide some insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, whether shared on Reddit or Twitter, and we consider them as an important source for seeds. This is because, the effort taken to create micro-collections is an indication of editorial activity and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.
C1  - Champaign, Illinois
C3  - Proceedings of the 18th Joint Conference on Digital Libraries
DA  - 2019/06/02/
PY  - 2019
DO  - 10.1109/JCDL.2019.00042
DP  - ACM Digital Library
SP  - 251
EP  - 260
PB  - IEEE Press
SN  - 978-1-72811-547-4
UR  - https://doi.org/10.1109/JCDL.2019.00042
Y2  - 2021/07/15/
L1  - https://dl.acm.org/doi/pdf/10.1109/JCDL.2019.00042
KW  - web archiving
KW  - social media
KW  - collection building
KW  - crawling
KW  - seeds
ER  - 

TY  - CONF
TI  - Revisionista.PT: Uncovering the News Cycle Using Web Archives
AU  - Martins, Flávio
AU  - Mourão, André
A2  - Jose, Joemon M.
A2  - Yilmaz, Emine
A2  - Magalhães, João
A2  - Castells, Pablo
A2  - Ferro, Nicola
A2  - Silva, Mário J.
A2  - Martins, Flávio
T3  - Lecture Notes in Computer Science
AB  - In this demo, we present a meta-journalistic tool that reveals post-publication changes in articles of Portuguese online news media. Revisionista.PT can uncover the news cycle of online media, offering a glimpse into an otherwise unknown dynamic edit history. We leverage on article snapshots periodically collected by Web archives to reconstruct an approximate timeline of the changes: additions, edits, and corrections. Revisionista.PT is currently tracking changes in about 140,000 articles published by 12 selected news sources and has a user-friendly interface that will be familiar to users of version control systems. In addition, an open source browser extension can be installed by users so that they can be alerted of changes to articles they may be reading. Initial work on this demo was started as an entry submitted into Arquivo.PT ’s 2019 Prize, where it received an award for second place.
C1  - Cham
C3  - Advances in Information Retrieval
DA  - 2020///
PY  - 2020
DO  - 10.1007/978-3-030-45442-5_59
DP  - Springer Link
SP  - 465
EP  - 469
LA  - en
PB  - Springer International Publishing
SN  - 978-3-030-45442-5
ST  - Revisionista.PT
L1  - https://link.springer.com/content/pdf/10.1007%2F978-3-030-45442-5_59.pdf
ER  - 

TY  - ELEC
TI  - Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive - ProQuest
AB  - Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.
DA  - 2021/07/16/07:26:11
PY  - 2021
LA  - hu
ST  - Archival strategies for contemporary collecting in a world of big data
UR  - https://www.proquest.com/docview/2546907231/B290DF7AC35D4B04PQ/11?accountid=15756
Y2  - 2021/07/16/07:26:11
L2  - https://www.proquest.com/docview/2546907231/B290DF7AC35D4B04PQ/11?accountid=15756
ER  - 

TY  - ELEC
TI  - Hachette Book Group v. Internet Archive: Is There a Better Way to Restore Balance in Copyright? - ProQuest
AB  - Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.
DA  - 2021/07/16/07:30:56
PY  - 2021
LA  - hu
ST  - Hachette Book Group v. Internet Archive
UR  - https://www.proquest.com/docview/2497898976/B290DF7AC35D4B04PQ/12?accountid=15756
Y2  - 2021/07/16/07:30:56
L2  - https://www.proquest.com/docview/2497898976/B290DF7AC35D4B04PQ/12?accountid=15756
ER  - 

TY  - JOUR
TI  - Web Archive Analytics
AU  - Völske, Michael
AU  - Bevendorff, Janek
AU  - Kiesel, Johannes
AU  - Stein, Benno
AU  - Fröbe, Maik
AU  - Hagen, Matthias
AU  - Potthast, Martin
T2  - arXiv:2107.00893 [cs]
AB  - Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.
DA  - 2021///
PY  - 2021
DO  - 10.18420/inf2020_05
DP  - arXiv.org
SN  - 1617-5468
UR  - http://arxiv.org/abs/2107.00893
Y2  - 2021/07/16/07:34:36
L1  - https://arxiv.org/pdf/2107.00893.pdf
L2  - https://arxiv.org/abs/2107.00893
KW  - Computer Science - Digital Libraries
KW  - Computer Science - Networking and Internet Architecture
KW  - Computer Science - Social and Information Networks
ER  - 

TY  - JOUR
TI  - Introducing A Dark Web Archival Framework
AU  - Brunelle, Justin F.
AU  - Farley, Ryan
AU  - Atkins, Grant
AU  - Bostic, Trevor
AU  - Hendrix, Marites
AU  - Zebrowski, Zak
T2  - arXiv:2107.04070 [cs]
AB  - We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to global citizens, such as those wishing to access information while under oppressive political regimes that work to limit information availability. However, little institutional archiving is performed on the dark web (limited to the Archive.is dark web presence, a page-at-a-time archiver). We use surface web tools, techniques, and procedures (TTPs) and adapt them for archiving the dark web. We demonstrate the viability of our framework in a proof-of-concept and narrowly scoped prototype, implemented with the following lightly adapted open source tools: the Brozzler crawler for capture, WARC file for storage, and pywb for replay. Using these tools, we demonstrate the viability of modified surface web archiving TTPs for archiving the dark web.
DA  - 2021/07/08/
PY  - 2021
DP  - arXiv.org
UR  - http://arxiv.org/abs/2107.04070
Y2  - 2021/07/16/07:49:25
L1  - https://arxiv.org/pdf/2107.04070.pdf
L2  - https://arxiv.org/abs/2107.04070
KW  - Computer Science - Digital Libraries
ER  - 

TY  - JOUR
TI  - From Print to Digital, from Document to Data: Digitalisation at the Publications Office of the European Union
AU  - Schafer, Valérie
T2  - Open Information Science
AB  - Since the 1970s, the Publications Office of the European Union, the official publisher of all the institutions and bodies of the EU, has had to adapt to a fast-changing situation as the number of EU Member States has grown and the number and nature of publications has evolved (including publishing public tenders of EU institutions and Member States in 1978 through a supplement to the Official Journal of the European Union and handling CELEX, an interinstitutional and multilingual automated documentation system for community law, in 1992). These changes occurred over several ages of computing. The computerisation of the Publications Office was primarily a response to the need for rationalisation and productivity, but the aim was also to gradually adapt to new types of document publication and consultation. These different stages of digitalisation required the constant transfer of information to a multitude of media. Supports, such as punched cards, optical discs and CD-ROMs, had varying life expectancies and are all evidence of attempts to digitise information before the Web. This evolution not only illustrates the need to constantly harmonise a large amount of information, it also highlights some continuities. It affects the management of information systems but also meets regularly updated standardisation, interoperability and sustainability needs within a complex ecosystem.
DA  - 2020/01/01/
PY  - 2020
DO  - 10.1515/opis-2020-0015
DP  - www.degruyter.com
VL  - 4
IS  - 1
SP  - 203
EP  - 216
LA  - en
SN  - 2451-1781
ST  - From Print to Digital, from Document to Data
UR  - https://www.degruyter.com/document/doi/10.1515/opis-2020-0015/html
Y2  - 2021/07/16/08:21:05
L1  - https://www.degruyter.com/document/doi/10.1515/opis-2020-0015/pdf
L2  - https://www.degruyter.com/document/doi/10.1515/opis-2020-0015/html
ER  - 

TY  - BOOK
TI  - Forschungsdatenmanagement an der ETH-Bibliothek
AU  - Töwe, Matthias
AB  - Forschungsdatenmanagement an der ETH-Bibliothek was published in Bibliotheken der Schweiz: Innovation durch Kooperation on page 250.
DA  - 2018/06/11/
PY  - 2018
DP  - www.degruyter.com
LA  - de
PB  - De Gruyter Saur
SN  - 978-3-11-055379-6
UR  - https://www.degruyter.com/document/doi/10.1515/9783110553796-015/html
Y2  - 2021/07/16/08:22:45
L1  - https://www.degruyter.com/document/doi/10.1515/9783110553796-015/pdf
L2  - https://www.degruyter.com/document/doi/10.1515/9783110553796-015/html
ER  - 

TY  - JOUR
TI  - Digitálna knižnica v dobe vírovej II. – Vývoj a služba národného systému sprístupňovania diel
AU  - Vozár, Zdenko
T2  - Digital Library in the era of coronavirus II - the development and series of the national system of accessing publications.
AB  - This article describes implementation, operation and development of emergency systems of digital libraries throughout the Czech Republic during the pandemic crisis of Covid-19 from 2020 till present (2021/03) as alternative information source, mainly for the university students and the public R&D sector. It details changes, evaluation and evolution of these operations, which were brought by the successive and unexpected development of pandemic of Covid-19 during the year 2020 and the early spring of 2021 – mainly the launch of National digital library portal, pressure on licencing policies and intern migration of data, also as introduction of continuous automatic web archiving campaign on this momentous event. An emergency of this magnitude created a pressure upon the necessity of re-evaluation of licencing policies in the short term, but also furthered the agenda of adjusting the general strategy of libraries towards building and enriching online services, especially digital libraries and repositories. It is precisely the access from home-office which facilitates accessibility of otherwise inaccessible titles for all students and registered readers. Moreover, this type of access allows instant and sustainable long term culture exchange during the times of almost total suspension of the circulation of printed words. Principally, this kind of new type of information circulation provided by digital library services should be attainable, but only in the environment of fair licencing agreements for all participants in the book market and information transmission. (English)
DA  - 2021/06//
PY  - 2021
DP  - EBSCOhost
VL  - 32
IS  - 1
SP  - 43
EP  - 58
J2  - Knihovna
SN  - 18013252
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=151051032&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:47:14
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=151051032&S=R&D=lxh&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSsq24SLCWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Národní digitální knihovna
KW  - National digital library
KW  - digital libraries
KW  - Česká republika
KW  - continuous covid crawl
KW  - Covid druhá vlna
KW  - Covid jaro 2020
KW  - Covid second and third wave
KW  - Covid spring 2020
KW  - covid-19
KW  - Czech Republic
KW  - data migration
KW  - digitální knihovny
KW  - díla nedostupná na trhu
KW  - emergency licence
KW  - home-office
KW  - kontinuální sklizeň webů
KW  - Kramerius
KW  - licence
KW  - migrace
KW  - Moravian Library
KW  - Moravská zemská knihovna
KW  - NDK.cz
KW  - nouzový stav
KW  - online access
KW  - online přístup
KW  - out of commerce works
KW  - práva
KW  - rights
KW  - rozvoj software
KW  - state of emergency
KW  - studenti
KW  - students
KW  - SW development
KW  - system of electronic access
KW  - systém zpřístupnění
KW  - universities
KW  - univerzity
KW  - výjimečná licence
ER  - 

TY  - JOUR
TI  - The anxious flâneur: Digital archiving and the Wayback Machine
AU  - Hartelius, E. Johanna
T2  - Quarterly Journal of Speech
AB  - The Wayback Machine, the world's most extensive web archive, contains over 370 billion webpages dating to 1996. Yet despite its tagline, "Universal Access to All Knowledge," overwhelmed visitors report frustration and trouble with keyword searching and site navigation. This essay uses the Wayback Machine to demonstrate how access as a digital archival ideal is realized only to the extent that it is defined in terms of delivery and not disposition. Further, I submit that the combination of copious delivery and a weak structure, or lack of meaningful arrangement, is conducive to flânerie, a perusal movement through digital objects. Contra the cultural imaginary of the flâneur as a figure of pleasure, I suggest that practices of flânerie generate an experience of displacement and angst. With reference to Martin Heidegger's concept of the unheimlich (uncanny), I characterize being in a web archive as immersive but anxiously placeless. I then rely on Heidegger again to identify a productively dispositional role for rhetoric in digital archiving. As logos, rhetoric builds a structure, or "dwelling," that might provide orientation and make the archival unheimlich tolerable. The essay's implications pertain to the conditions of inhabitability in networked culture, which by design is functionally archival.
DA  - 2020/00//
PY  - 2020
DO  - 10.1080/00335630.2020.1828604
DP  - EBSCOhost
VL  - 106
IS  - 4
SP  - 377
EP  - 398
J2  - Quarterly Journal of Speech
SN  - 00335630
ST  - The anxious flâneur
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=146791187&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:49:22
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=146791187&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSsq%2B4Sq%2BWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - WEB archives
KW  - Wayback Machine
KW  - archive
KW  - ACCESS to knowledge movement
KW  - ANXIETY
KW  - browsing
KW  - DISPLACEMENT (Psychology)
KW  - Flaneur
KW  - KEYWORD searching
KW  - UNCANNY, The (Psychoanalysis)
KW  - unheimlich
KW  - WAYBACK Machine (Web resource)
KW  - WEB browsing
ER  - 

TY  - JOUR
TI  - This Account Doesn't Exist: Tweet Decay and the Politics of Deletion in the Brexit Debate
AU  - Bastos, Marco
T2  - American Behavioral Scientist
AB  - Literature on influence operations has identified metrics that are indicative of social media manipulation, but few studies have explored the lifecycle of low-quality information. We contribute to this literature by reconstructing nearly 3 million messages posted by 1 million users in the last days of the Brexit referendum campaign. While previous studies have found that on average only 4% of tweets disappear, we found that 33% of the tweets leading up to the referendum vote are no longer available. Only about half of the most active accounts that tweeted the referendum continue to operate publicly, and 20% of all accounts are no longer active. We tested whether partisan content was more likely to disappear and found more messages from the Leave campaign that disappeared than the entire universe of tweets affiliated with the Remain campaign. We compare these results with an assorted set of 45 hashtags posted in the same period and find that political campaigns present much higher ratios of user and tweet decay. These results are validated by inspecting 2 million Brexit-related tweets posted over a period of nearly 4 years. The article concludes with an overview of these findings and recommendations for future research.
DA  - 2021/05//
PY  - 2021
DO  - 10.1177/0002764221989772
DP  - EBSCOhost
VL  - 65
IS  - 5
SP  - 757
EP  - 773
J2  - American Behavioral Scientist
SN  - 00027642
ST  - This Account Doesn't Exist
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=149787344&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:51:23
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=149787344&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSs6e4SrCWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Twitter
KW  - web archive
KW  - Brexit
KW  - BREXIT Referendum, 2016
KW  - BRITISH withdrawal from the European Union, 2016-2020
KW  - disinformation
KW  - manipulation
KW  - MICROBLOGS
KW  - misinformation
KW  - POLITICAL campaigns
KW  - PRACTICAL politics
KW  - SPINAL adjustment
KW  - TWITTER Inc.
ER  - 

TY  - JOUR
TI  - Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities
AU  - Dogucu, Mine
AU  - Çetinkaya-Rundel, Mine
T2  - Journal of Statistics Education
AB  - Best practices in statistics and data science courses include the use of real and relevant data as well as teaching the entire data science cycle starting with importing data. A rich source of real and current data is the web, where data are often presented and stored in a structure that needs some wrangling and transforming before they can be ready for analysis. The web is a resource students naturally turn to for finding data for data analysis projects, but without formal instruction on how to get that data into a structured format, they often resort to copy-pasting or manual entry into a spreadsheet, which are both time consuming and error-prone. Teaching web scraping provides an opportunity to bring such data into the curriculum in an effective and efficient way. In this article, we explain how web scraping works and how it can be implemented in a pedagogically sound and technically executable way at various levels of statistics and data science curricula. We provide classroom activities where we connect this modern computing technique with traditional statistical topics. Finally, we share the opportunities web scraping brings to the classrooms as well as the challenges to instructors and tips for avoiding them.
DA  - 2021/00/02/
PY  - 2021
DO  - 10.1080/10691898.2020.1787116
DP  - EBSCOhost
VL  - 29
SP  - S112
EP  - S122
J2  - Journal of Statistics Education
SN  - 10691898
ST  - Web Scraping in the Statistics and Data Science Curriculum
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=151115063&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:56:22
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=151115063&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSs6y4Sq%2BWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Data science
KW  - BEST practices
KW  - CLASSROOM activities
KW  - Curriculum
KW  - CURRICULUM
KW  - DATA analysis
KW  - DATA science
KW  - R language
KW  - Teaching
KW  - Web scraping
ER  - 

TY  - JOUR
TI  - Scraping the demos. Digitalization, web scraping and the democratic project
AU  - Ulbricht, Lena
T2  - Democratization
AB  - Scientific, political and bureaucratic elites use epistemic practices like "big data analysis" and "web scraping" to create representations of the citizenry and to legitimize policymaking. I develop the concept of "demos scraping" for these practices of gaining information about citizens (the "demos") through automated analysis of digital trace data which are re-purposed for political means. This article critically engages with the discourse advocating demos scraping and provides a conceptual analysis of its democratic implications. It engages with the promise of demos scraping advocates to reduce the gap between political elites and citizens and highlights how demos scraping is presented as a superior form of accessing the "will of the people" and to increase democratic legitimacy. This leads me to critically discuss the implications of demos scraping for political representation and participation. In its current form, demos scraping is technocratic and de-politicizing; and the larger political and economic context in which it takes place makes it unlikely that it will reduce the gap between elites and citizens. From the analytic perspective of a post-democratic turn, demos scraping is an attempt of late modern and digitalized societies to address the democratic paradox of increasing citizen expectations coupled with a deep legitimation crisis.
DA  - 2020/04//
PY  - 2020
DO  - 10.1080/13510347.2020.1714595
DP  - EBSCOhost
VL  - 27
IS  - 3
SP  - 426
EP  - 442
J2  - Democratization
SN  - 13510347
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=141675697&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:57:20
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=141675697&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSs624SbeWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - big data
KW  - DATA mining
KW  - BIG data
KW  - data mining
KW  - GOVERNMENT policy
KW  - LEGITIMACY of governments
KW  - political participation
KW  - POLITICAL participation
KW  - political representation
KW  - post-democracy
KW  - public policy
KW  - responsiveness
KW  - technocracy
ER  - 

TY  - JOUR
TI  - Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling
AU  - Thielmann, Anton
AU  - Weisser, Christoph
AU  - Krenz, Astrid
AU  - Säfken, Benjamin
T2  - Journal of Applied Statistics
AB  - Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
DA  - 2021/04/28/
PY  - 2021
DO  - 10.1080/02664763.2021.1919063
DP  - EBSCOhost
SP  - 1
EP  - 18
J2  - Journal of Applied Statistics
SN  - 02664763
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=150016117&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:58:31
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=150016117&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprZSs664SrOWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - machine learning
KW  - LDA topic model
KW  - one-class SVM
KW  - out-of-domain training data
KW  - Unsupervised document classification
KW  - web scraping
ER  - 

TY  - JOUR
TI  - Medical informatics labor market analysis using web crawling, web scraping, and text mining
AU  - Schedlbauer, Jürgen
AU  - Raptis, Georgios
AU  - Ludwig, Bernd
T2  - International Journal of Medical Informatics
AB  - <bold>Objectives: </bold>The European University Association (EUA) defines "employability" as a major goal of higher education. Therefore, competence-based orientation is an important aspect of education. The representation of a standardized job profile in the field of medical informatics, which is based on the most common labor market requirements, is fundamental for identifying and conveying the learning goals corresponding to these competences.<bold>Methods: </bold>To identify the most common requirements, we extracted 544 job advertisements from the German job portal, STEPSTONE. This process was conducted via a program we developed in R with the "rvest" library, utilizing web crawling, web extraction, and text mining. After removing duplicates and filtering for jobs that required a bachelor's degree, 147 job advertisements remained, from which we extracted qualification terms. We categorized the terms into six groups: professional expertise, soft skills, teamwork, processes, learning, and problem-solving abilities.<bold>Results: </bold>The results showed that only 45% of the terms are related to professional expertise, while 55% are related to soft skills. Studies of employee soft skills have shown similar results. The most prevalent terms were programming, experience, project, and server. Our second major finding is the importance of experience, further underlining how essential practical skills are.<bold>Conclusions: </bold>Previous studies used surveys and narrative descriptions. This is the first study to use web crawling, web extraction, and text mining. Our research shows that soft skills and specialist knowledge carry equal weight. The insights gained from this study may be of assistance in developing curricula for medical informatics.
DA  - 2021/06//
PY  - 2021
DO  - 10.1016/j.ijmedinf.2021.104453
DP  - EBSCOhost
VL  - 150
SP  - N.PAG
EP  - N.PAG
J2  - International Journal of Medical Informatics
SN  - 13865056
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=150299043&lang=hu&site=ehost-live
Y2  - 2021/07/16/08:59:53
KW  - Text mining
KW  - Competence-based education
KW  - Graduate employability
KW  - Medical informatics
KW  - Soft skills
ER  - 

TY  - JOUR
TI  - Web mining for innovation ecosystem mapping: a framework and a large-scale pilot study
AU  - Kinne, Jan
AU  - Axenbeck, Janna
T2  - Scientometrics
AB  - Existing approaches to model innovation ecosystems have been mostly restricted to qualitative and small-scale levels or, when relying on traditional innovation indicators such as patents and questionnaire-based survey, suffered from a lack of timeliness, granularity, and coverage. Websites of firms are a particularly interesting data source for innovation research, as they are used for publishing information about potentially innovative products, services, and cooperation with other firms. Analyzing the textual and relational content on these websites and extracting innovation-related information from them has the potential to provide researchers and policy-makers with a cost-effective way to survey millions of businesses and gain insights into their innovation activity, their cooperation, and applied technologies. For this purpose, we propose a web mining framework for consistent and reproducible mapping of innovation ecosystems. In a large-scale pilot study we use a database with 2.4 million German firms to test our framework and explore firm websites as a data source. Thereby we put particular emphasis on the investigation of a potential bias when surveying innovation systems through firm websites if only certain firm types can be surveyed using our proposed approach. We find that the availability of a websites and the characteristics of the website (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. We also find that patenting firms will be overrepresented in web mining studies. Web mining as a survey method also has to cope with extremely large and hyper-connected outlier websites and the fact that low broadband availability appears to prevent some firms from operating their own website and thus excludes them from web mining analysis. We then apply the proposed framework to map an exemplary innovation ecosystem of Berlin-based firms that are engaged in artificial intelligence. Finally, we outline several approaches how to transfer firm website content into valuable innovation indicators.
DA  - 2020/00//
PY  - 2020
DO  - 10.1007/s11192-020-03726-9
DP  - EBSCOhost
VL  - 125
IS  - 3
SP  - 2011
EP  - 2041
J2  - Scientometrics
SN  - 01389130
ST  - Web mining for innovation ecosystem mapping
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=147251429&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:00:48
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=147251429&S=R&D=lxh&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSrqa4TLWWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSr6a4SbSWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Pilot projects
KW  - Artificial intelligence
KW  - Web scraping
KW  - Berlin (Germany)
KW  - Hyperlinks
KW  - Innovation
KW  - Internet content
KW  - Internet surveys
KW  - Mining methodology
KW  - Web mining
ER  - 

TY  - JOUR
TI  - Strategies to access web-enabled urban spatial data for socioeconomic research using R functions
AU  - Vallone, Andrés
AU  - Chasco, Coro
AU  - Sánchez, Beatriz
T2  - Journal of Geographical Systems
AB  - Since the introduction of the World Wide Web in the 1990s, available information for research purposes has increased exponentially, leading to a significant proliferation of research based on web-enabled data. Nowadays the use of internet-enabled databases, obtained by either primary data online surveys or secondary official and non-official registers, is common. However, information disposal varies depending on data category and country and specifically, the collection of microdata at low geographical level for urban analysis can be a challenge. The most common difficulties when working with secondary web-enabled data can be grouped into two categories: accessibility and availability problems. Accessibility problems are present when the data publication in the servers blocks or delays the download process, which becomes a tedious reiterative task that can produce errors in the construction of big databases. Availability problems usually arise when official agencies restrict access to the information for statistical confidentiality reasons. In order to overcome some of these problems, this paper presents different strategies based on URL parsing, PDF text extraction, and web scraping. A set of functions, which are available under a GPL-2 license, were built in an R package to specifically extract and organize databases at the municipality level (NUTS 5) in Spain for population, unemployment, vehicle fleet, and firm characteristics.
DA  - 2020/04//
PY  - 2020
DO  - 10.1007/s10109-019-00309-y
DP  - EBSCOhost
VL  - 22
IS  - 2
SP  - 217
EP  - 239
J2  - Journal of Geographical Systems
SN  - 14355930
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=142293799&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:02:19
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=142293799&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSrqe4TLaWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - WORLD Wide Web
KW  - Spain
KW  - Web scraping
KW  - C81
KW  - C88
KW  - PDF (Computer file format)
KW  - PREFABRICATED buildings
KW  - R58
KW  - SECONDARY analysis
KW  - SET functions
KW  - SPAIN
KW  - Spatial microdata
KW  - STATISTICS
KW  - URL parsing
ER  - 

TY  - JOUR
TI  - Scraping the Web for Public Health Gains: Ethical Considerations from a 'Big Data' Research Project on HIV and Incarceration
AU  - Rennie, Stuart
AU  - Buchbinder, Mara
AU  - Juengst, Eric
AU  - Brinkley-Rubinstein, Lauren
AU  - Blue, Colleen
AU  - Rosen, David L
T2  - Public Health Ethics
AB  - Web scraping involves using computer programs for automated extraction and organization of data from the Web for the purpose of further data analysis and use. It is frequently used by commercial companies, but also has become a valuable tool in epidemiological research and public health planning. In this paper, we explore ethical issues in a project that "scrapes" public websites of U.S. county jails as part of an effort to develop a comprehensive database (including individual-level jail incarcerations, court records and confidential HIV records) to enhance HIV surveillance and improve continuity of care for incarcerated populations. We argue that the well-known framework of Emanuel et al. (2000) provides only partial ethical guidance for the activities we describe, which lie at a complex intersection of public health research and public health practice. We suggest some ethical considerations from the ethics of public health practice to help fill gaps in this relatively unexplored area.
DA  - 2020/04//
PY  - 2020
DO  - 10.1093/phe/phaa006
DP  - EBSCOhost
VL  - 13
IS  - 1
SP  - 111
EP  - 121
J2  - Public Health Ethics
SN  - 17549973
ST  - Scraping the Web for Public Health Gains
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=144947375&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:05:31
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSrqq4TbaWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - BIG data
KW  - CONFIDENTIAL records
KW  - HEALTH planning
KW  - PUBLIC health ethics
KW  - PUBLIC health research
ER  - 

TY  - JOUR
TI  - An analytical system for evaluating academia units based on metrics provided by academic social network
AU  - Wiechetek, Lukasz
AU  - Phusavat, Kongkiti
AU  - Pastuszak, Zbigniew
T2  - Expert Systems with Applications
AB  - • Data from social networks can be used for researchers and research units evaluation. • Formal and natural sciences researchers more often use RG and have higher metrics. • Different types of researchers (position, field) shouldn't be directly compared. • Free software allows developing of analytical tools for fast scientists evaluation. Social networks are becoming more and more popular, not only among young people looking for entertainment, but also among specialists, experts and researchers who wish to establish professional networks, develop business or research projects. They may be useful also for the comparison and evaluation of scientists and research organizations. This study aims to show how to build a framework of an analytical system for evaluation of researchers and research units using the data retrieved from an academic social network. Acquired data are used to find out the main differences between ResarchGate (RG) usage and values of metrics owned by scientists of different gender, scientific title and field of study to find out if various groups of employees can be directly compared. The authors apply web scraping technique for collecting data from university web page (2847 employees) and use R scripts to acquire the metrics form RG portal. Also, data of 1497 researchers and teaching workers from 11 faculties at Polish university were explored. The descriptive statistics, Chi square test, ANOVA and logistic regression were used to analyse the main RG metrics: RG Score, number of publications, reads and citations. Analysis shows the significant differences both in terms of popularity of ResearchGate and values of its main metrics. The research confirmed that 1) the rvest package allows for fast data acquisition from RG, 2) RG metrics can be used by university managers to compare achievements and progress of single researchers, research labs, departments or faculties, 3) Researchers employed at the faculties of formal and natural sciences use RG portal more frequently, possess higher values of RG metrics, therefore different types of workers and various branches of science shouldn't be compared directly.
DA  - 2020/00/30/
PY  - 2020
DO  - 10.1016/j.eswa.2020.113608
DP  - EBSCOhost
VL  - 159
SP  - N.PAG
EP  - N.PAG
J2  - Expert Systems with Applications
SN  - 09574174
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=145756320&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:13:21
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSr6m4Sq6WxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - WEBSITES
KW  - Web scraping
KW  - Academic social network
KW  - ACQUISITION of data
KW  - Analytical system
KW  - CHI-squared test
KW  - Comparative analysis
KW  - DESCRIPTIVE statistics
KW  - FREEWARE (Computer software)
KW  - ResearchGate
KW  - SOCIAL networks
KW  - University evaluation
ER  - 

TY  - JOUR
TI  - Academic Social Networking Sites are Smaller, Denser Networks Conducive to Formal Identity Management, Whereas Academic Twitter is Larger, More Diffuse, and Affords More Space for Novel Connections
AU  - Goldstein, Scott
T2  - Evidence Based Library & Information Practice
AB  - Objective - To examine the structure of academics' online social networks and how academics understand and interpret them. Design - Mixed methods consisting of network analysis and semi-structured interviews. Setting - Academics based in the United Kingdom. Subjects - 55 U.K.-based academics who use an academic social networking site and Twitter, of whom 18 were interviewed. Methods - For each subject, ego-networks were collected from Twitter and either ResearchGate or Academia.edu. Twitter data were collected primarily via the Twitter API, and the social networking site data were collected either manually or using a commercial web scraping program. Edge tables were created in Microsoft Excel spreadsheets and imported into Gephi for analysis and visualization. A purposive subsample of subjects was interviewed via Skype using a semi-structured format intended to illuminate further the network analysis findings. Transcripts were deductively coded using a grounded theory-based approach. Main Results - Network analysis replicated earlier findings in the literature. A large number of academics have relatively few connections to others in the network, while a small number have relatively many connections. In terms of reciprocity (the proportion of mutual ties or pairings out of all possible pairings that could exist in the network), arts and humanities disciplines were significantly more reciprocal. Communities (measured using the modularity algorithm, which looks at the density of links within and between different subnetworks) are more frequently defined by institutions and research interests on academic social networking sites and by research interests and personal interests on Twitter. The overall picture was reinforced by the qualitative analysis. According to interview participants, academic social networking sites reflect pre-existing professional relationships and do not foreground social interaction, serving instead as a kind of virtual CV. By contrast, Twitter is analogized to a conference coffee break, where users can form new connections.
DA  - 2020/00//
PY  - 2020
DO  - 10.18438/eblip29687
DP  - EBSCOhost
VL  - 15
IS  - 1
SP  - 226
EP  - 228
J2  - Evidence Based Library & Information Practice
SN  - 1715720X
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=142859999&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:14:25
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSr6q4Sq%2BWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Social media
KW  - Online social networks
KW  - Academic librarians
ER  - 

TY  - JOUR
TI  - Digital methods in a post-API environment
AU  - Perriam, Jessamy
AU  - Birkbak, Andreas
AU  - Freeman, Andy
T2  - International Journal of Social Research Methodology
AB  - Qualitative and mixed methods digital social research often relies on gathering and storing social media data through the use of APIs (Application Programming Interfaces). In past years this has been relatively simple, with academic developers and researchers using APIs to access data and produce visualisations and analysis of social networks and issues. In recent years, API access has become increasingly restricted and regulated by corporations at the helm of social media networks. Facebook (the corporation) has restricted academic research access to Facebook (the social media platform) along with Instagram (a Facebook-owned social media platform). Instead, they have allowed access to sources where monetisation can easily occur, in particular, marketers and advertisers. This leaves academic researchers of digital social life in a difficult situation where API related research has been curtailed. In this paper we describe some rationales and methodologies for using APIs in social research. We then introduce some of the major events in academic API use that have led to the prohibitive situation researchers now find themselves in. Finally, we discuss the methodological and ethical issues this produces for researchers and, suggest some possible steps forward for API related research.
DA  - 2020/05//
PY  - 2020
DO  - 10.1080/13645579.2019.1682840
DP  - EBSCOhost
VL  - 23
IS  - 3
SP  - 277
EP  - 290
J2  - International Journal of Social Research Methodology
SN  - 13645579
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=142313286&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:16:13
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSr6y4Sa%2BWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - Twitter
KW  - ethics
KW  - UNIVERSITY research
KW  - SOCIAL media
KW  - Facebook
KW  - APIs
KW  - web scraping
KW  - SOCIAL networks
KW  - Digital methods
KW  - Netvizz
KW  - SOCIAL media in business
KW  - SOCIAL media in education
KW  - SOCIAL network analysis
KW  - SOCIAL network theory
KW  - SOCIAL science research
ER  - 

TY  - JOUR
TI  - iEcology: Harnessing Large Online Resources to Generate Ecological Insights
AU  - Jarić, Ivan
AU  - Correia, Ricardo A.
AU  - Brook, Barry W.
AU  - Buettel, Jessie C.
AU  - Courchamp, Franck
AU  - Di Minin, Enrico
AU  - Firth, Josh A.
AU  - Gaston, Kevin J.
AU  - Jepson, Paul
AU  - Kalinkat, Gregor
AU  - Ladle, Richard
AU  - Soriano-Redondo, Andrea
AU  - Souza, Allan T.
AU  - Roll, Uri
T2  - Trends in Ecology & Evolution
AB  - Digital data are accumulating at unprecedented rates. These contain a lot of information about the natural world, some of which can be used to answer key ecological questions. Here, we introduce iEcology (i.e., internet ecology), an emerging research approach that uses diverse online data sources and methods to generate insights about species distribution over space and time, interactions and dynamics of organisms and their environment, and anthropogenic impacts. We review iEcology data sources and methods, and provide examples of potential research applications. We also outline approaches to reduce potential biases and improve reliability and applicability. As technologies and expertise improve, and costs diminish, iEcology will become an increasingly important means to gain novel insights into the natural world. iEcology is a new research approach that seeks to quantify patterns and processes in the natural world using data accumulated in digital sources collected for other purposes. iEcology studies have provided new insights into species occurrences, traits, phenology, functional roles, behavior, and abiotic environmental features. iEcology is expanding, and will be able to provide valuable support for ongoing research efforts, as comparatively low-cost research based on freely available data. We expect that iEcology will experience rapid development over coming years and become one of the major research approaches in ecology, enhanced by emerging technologies such as automated content analysis, apps, internet of things, ecoacoustics, web scraping, and open source hardware.
DA  - 2020/07//
PY  - 2020
DO  - 10.1016/j.tree.2020.03.003
DP  - EBSCOhost
VL  - 35
IS  - 7
SP  - 630
EP  - 639
J2  - Trends in Ecology & Evolution
SN  - 01695347
ST  - iEcology
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=a9h&AN=143557824&lang=hu&site=ehost-live
Y2  - 2021/07/16/09:17:26
L1  - http://content.ebscohost.com/ContentServer.asp?T=P&P=AN&K=144947375&S=R&D=a9h&EbscoContent=dGJyMMTo50Sep7Q4yOvqOLCmsEmeprdSr624SrGWxWXS&ContentCustomer=dGJyMPGrrkuurLFRuePfgeyx44Dt6fIA
KW  - TECHNOLOGICAL innovations
KW  - social media
KW  - digital data
KW  - internet
KW  - data mining
KW  - ARCHITECTURAL acoustics
KW  - biodiversity
KW  - biogeography
KW  - CONTENT analysis
KW  - culturomics
KW  - INTERNET of things
KW  - INTERNET privacy
KW  - PHENOLOGY
KW  - SPECIES distribution
ER  - 

TY  - THES
TI  - MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
AU  - Alam, Sawood
AB  - With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator's URI requests. Additionally, we use fulltext search (when available) or sample URI lookups to build an understanding of an archive's holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes.For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive's Wayback Machine, UK Web Archive, Arquivo.pt, LANL's Memento Proxy, and ODU's MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ.In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile.We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present.In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives.We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives.We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), InterPlanetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file format specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps.
CY  - Ann Arbor
DA  - 2020///
PY  - 2020
SP  - 251
LA  - English
M3  - Ph.D.
PB  - Old Dominion University
UR  - https://www.proquest.com/dissertations-theses/mementomap-web-archive-profiling-framework/docview/2478763660/se-2?accountid=15756
AN  - 2478763660
DB  - ProQuest One Academic
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQuest+Dissertations+%26+Theses+Global&fmt=dissertation&genre=dissertations+%26+theses&issn=&volume=&issue=&date=2020-01-01&spage=&title=MementoMap%3A+A+Web+Archive+Profiling+Framework+for+Efficient+Memento+Routing&atitle=&au=Alam%2C+Sawood&isbn=9798557052580&jtitle=&btitle=&id=doi:
KW  - Web archiving
KW  - Information science
KW  - Memento
KW  - Computer science
KW  - 0984:Computer science
KW  - Query routing
KW  - Information technology
KW  - 0489:Information Technology
KW  - 0723:Information science
KW  - Memento routing
KW  - MementoMap
KW  - World wide web
ER  - 

TY  - JOUR
TI  - Hachette Book Group v. Internet Archive: Is There a Better Way to Restore Balance in Copyright?
AU  - Schard, Robin
T2  - Internet Reference Services Quarterly
AB  - Using the opening of the National Emergency Library as an opportunity, four large publishers, Hachette Book Group, HarperCollins Publishers, John Wiley & Sons, and Penguin Random House, filed suit against the Internet Archive claiming copyright infringement. This article discusses the lawsuit and the claims on both sides before discussing the weaknesses for the parties, and recommending that negotiation would be the best way to move forward.
DA  - 2021/00//undefined
PY  - 2021
DO  - 10.1080/10875301.2021.1875100
VL  - 24
IS  - 1-2
SP  - 53
EP  - 58
LA  - English
SN  - 1087-5301
UR  - https://www.proquest.com/scholarly-journals/hachette-book-group-v-internet-archive-is-there/docview/2497898976/se-2?accountid=15756
AN  - 2497898976
DB  - SciTech Premium Collection
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Acomputerinfo&fmt=journal&genre=article&issn=10875301&volume=24&issue=1-2&date=2021-01-01&spage=53&title=Internet+Reference+Services+Quarterly&atitle=Hachette+Book+Group+v.+Internet+Archive%3A+Is+There+a+Better+Way+to+Restore+Balance+in+Copyright%3F&au=Schard%2C+Robin&isbn=&jtitle=Internet+Reference+Services+Quarterly&btitle=&id=doi:10.1080%2F10875301.2021.1875100
KW  - Web archiving
KW  - Library And Information Sciences
KW  - Archives & records
KW  - Internet
KW  - Copyright
KW  - Litigation
KW  - copyright
KW  - fair use
KW  - controlled digital lending
KW  - Hachette Book Group
KW  - Inc. v. Internet Archive
KW  - Infringement
KW  - National Emergency Library
KW  - Open Library
KW  - Publishing industry
ER  - 

TY  - JOUR
TI  - Changes in Web Content in First 20 NIRF Ranking Institutes During 2010-19: an Analysis
AU  - Gangopadhyay, Subrata
T2  - Library Philosophy and Practice
AB  - Web content is an important source for education and research. At present it is a mandatory requirement for higher learning institutes of India to present information on their institutional home page. Due to dynamic nature of web content and increase use of emerging technology, the new ways of presenting information on higher education web sites become complex. In this paper, we try to study the changes in web content during last decade in first 20 NIRF ranking institute. The Internet Archives Wayback Machine has been used to get the web site update dates and the content of archived web pages.
DA  - 2020/05//
PY  - 2020
SP  - 1
EP  - 9
LA  - English
UR  - https://www.proquest.com/scholarly-journals/changes-web-content-first-20-nirf-ranking/docview/2447005010/se-2?accountid=15756
AN  - 2447005010
DB  - ProQuest One Academic; Publicly Available Content Database
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Alibraryscience&fmt=journal&genre=article&issn=&volume=&issue=&date=2020-05-01&spage=1&title=Library+Philosophy+and+Practice&atitle=Changes+in+Web+Content+in+First+20+NIRF+Ranking+Institutes+During+2010-19%3A+an+Analysis&au=Gangopadhyay%2C+Subrata&isbn=&jtitle=Library+Philosophy+and+Practice&btitle=&id=doi:
KW  - Web archiving
KW  - World Wide Web
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Internet
KW  - National libraries
KW  - Web sites
KW  - Students
KW  - Higher education
KW  - India
KW  - Information dissemination
ER  - 

TY  - JOUR
TI  - From tree to network: reordering an archival catalogue
AU  - Bell, Mark
T2  - Records Management Journal
AB  - PurposeThis paper presents the results of a number of experiments performed at the National Archives, all related to the theme of linking collections of records. This paper aims to present a methodology for translating a hierarchy into a network structure using a number of methods for deriving statistical distributions from records metadata or content and then aggregating them. Simple similarity metrics are then used to compare and link, collections of records with similar characteristics.Design/methodology/approachThe approach taken is to consider a record at any level of the catalogue hierarchy as a summary of its children. A distribution for each child record is created (e.g. word counts and date distribution) and averaged/summed with the other children. This process is repeated up the hierarchy to find a representative distribution of the whole series. By doing this the authors can compare record series together and create a similarity network.FindingsThe summarising method was found to be applicable not only to a hierarchical catalogue but also to web archive data, which is by nature stored in a hierarchical folder structure. The case studies raised many questions worthy of further exploration such as how to present distributions and uncertainty to users and how to compare methods, which produce similarity scores on different scales.Originality/valueAlthough the techniques used to create distributions such as topic modelling and word frequency counts, are not new and have been used to compare documents, to the best of the knowledge applying the averaging approach to the archival catalogue is new. This provides an interesting method for zooming in and out of a collection, creating networks at different levels of granularity according to user needs.
DA  - 2020///
PY  - 2020
DO  - 10.1108/RMJ-09-2019-0051
VL  - 30
IS  - 3
SP  - 379
EP  - 394
LA  - English
SN  - 09565698
UR  - https://www.proquest.com/scholarly-journals/tree-network-reordering-archival-catalogue/docview/2466656573/se-2?accountid=15756
AN  - 2466656573
DB  - ProQuest One Academic
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Aabiglobal&fmt=journal&genre=article&issn=09565698&volume=30&issue=3&date=2020-09-01&spage=379&title=Records+Management+Journal&atitle=From+tree+to+network%3A+reordering+an+archival+catalogue&au=Bell%2C+Mark&isbn=&jtitle=Records+Management+Journal&btitle=&id=doi:10.1108%2FRMJ-09-2019-0051
KW  - Web archiving
KW  - Archives
KW  - Digital archives
KW  - Case studies
KW  - Business And Economics--Management
KW  - Network analysis
KW  - Record linkage
KW  - Topic modelling
ER  - 

TY  - JOUR
TI  - Getting acquainted with social networks and apps: capturing and archiving social media content
AU  - Katie Elson Anderson
T2  - Library Hi Tech News
AB  - The ephemeral nature of the content and perceived lack of permanency of the platforms led to questions about the actual staying power of sites such as Facebook and Twitter.

The important thing to remember is that while the platforms and apps may continue to thrive or be shuttered, created or forgotten, the underlying nature of connection, networking, data storage and content sharing is unlikely to change dramatically, just the platform, method and space may change.

Libraries and librarians have been part of that consistent group of users, embracing the ability to post in a number of different formats, provide attribution and connect with communities (Power, 2014; Anderson, 2015).

Looking beyond the personal risk of losing one’s teenage online past or the comments on an old blog post to the larger impact of social media and society, one can quickly see the importance of preserving and archiving large chunks of internet history represented on these social media platforms.
DA  - 2020///
PY  - 2020
DO  - 10.1108/LHTN-03-2019-0011
VL  - 37
IS  - 2
SP  - 18
EP  - 22
LA  - English
SN  - 07419058
UR  - https://www.proquest.com/trade-journals/getting-acquainted-with-social-networks-apps/docview/2499027983/se-2?accountid=15756
AN  - 2499027983
DB  - ProQuest One Academic; SciTech Premium Collection
L1  - https://www.emerald.com/insight/content/doi/10.1108/LHTN-03-2019-0011/full/pdf?title=getting-acquainted-with-social-networks-and-apps-capturing-and-archiving-social-media-content
L2  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=142321484&lang=hu&site=ehost-live&scope=cite
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Alibraryscience&fmt=journal&genre=article&issn=07419058&volume=37&issue=2&date=2020-02-10&spage=18&title=Library+Hi+Tech+News&atitle=Getting+acquainted+with+social+networks+and+apps%3A+capturing+and+archiving+social+media+content&au=Katie+Elson+Anderson&isbn=&jtitle=Library+Hi+Tech+News&btitle=&id=doi:10.1108%2FLHTN-03-2019-0011
KW  - Web archiving
KW  - Web archives
KW  - Archiving
KW  - Digital archives
KW  - Digital media
KW  - Social media
KW  - Social networks
KW  - Libraries
KW  - Librarians
KW  - Automation
KW  - Information professionals
KW  - Internet content
KW  - Data storage
KW  - History of archives
KW  - Library And Information Sciences--Computer Applications
ER  - 

TY  - THES
TI  - A Framework for Verifying the Fixity of Archived Web Resources
AU  - Aturban, Mohamed
AB  - The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure that an archived resource has remained unaltered (i.e., fixed) since the time it was captured. Currently, end users do not have the ability to easily verify the fixity of content preserved in web archives. For instance, if a web page is archived in 1999 and replayed in 2019, how do we know that it has not been tampered with during those 20 years? In order for the users of web archives to verify that archived web resources have not been altered, they should have access to fixity information associated with these resources. However, most web archives do not allow accessing fixity information and, more importantly, even if fixity information is available, it is provided by the same archive delivering the resource, not by an independent archive or service. 
In this research, we present a framework for establishing and checking the fixity on the playback of archived resources, or mementos. The framework defines an archive-aware hashing function that consists of several guidelines for generating repeatable fixity information on the playback of mementos. These guidelines are results of our 14-month study for identifying and quantifying changes in replayed mementos over time that affect generating repeatable fixity information. Changes on the playback of mementos may be caused by JavaScript, transient errors, inconsistency in the availability of mementos over time, and archive-specific resources. Changes are also caused by transformations in the content of archived resources applied by web archives to appropriately replay these resources in a user's browser.  The study also shows that only 11.55% of mementos always produce the same fixity information after each replay, while about 16.06% of mementos always produce different fixity information after each replay. The remaining 72.39% of mementos produce multiple unique fixity information. We also find that mementos may disappear when web archives move to different domains or archives.
In addition to defining multiple guidelines for generating fixity information, the framework introduces two approaches, Atomic and Block, that can be used to disseminate fixity information to web archives. The main difference between the two approaches is that, in the Atomic approach, the fixity information of each archived web page is stored in a separate file before being disseminated to several on-demand web archives, while in the Block approach, we batch together fixity information of multiple archived pages to a single binary-searchable file before being disseminated to archives. The framework defines the structure of URLs used to publish fixity information on the web and retrieve archived fixity information from web archives. Our framework  does not require changes in the current web archiving infrastructure, and it is built based on well-known web archiving standards, such as the Memento protocol. The proposed framework will allow users to generate fixity information on any archived page at any time, preserve the fixity information independently from the archive delivering  the archived page, and verify the fixity of the archived page at any time in the future.
CY  - Ann Arbor
DA  - 2020///
PY  - 2020
SP  - 260
LA  - English
M3  - Ph.D.
PB  - Old Dominion University
UR  - https://www.proquest.com/dissertations-theses/framework-verifying-fixity-archived-web-resources/docview/2451138951/se-2?accountid=15756
AN  - 2451138951
DB  - ProQuest One Academic
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQuest+Dissertations+%26+Theses+Global&fmt=dissertation&genre=dissertations+%26+theses&issn=&volume=&issue=&date=2020-01-01&spage=&title=A+Framework+for+Verifying+the+Fixity+of+Archived+Web+Resources&atitle=&au=Aturban%2C+Mohamed&isbn=9798678108180&jtitle=&btitle=&id=doi:
KW  - Web archiving
KW  - Archives
KW  - Memento
KW  - Computer science
KW  - Library science
KW  - 0399:Library science
KW  - 0646:Web Studies
KW  - 0984:Computer science
KW  - Archived web pages
KW  - Framework
KW  - Verifying fixity
KW  - Web studies
ER  - 

TY  - NEWS
TI  - Archiving the Pandemic: UTA Libraries project preserves community experiences with COVID-19
T2  - University Wire
CY  - Carlsbad
DA  - 2020/07/20/
PY  - 2020
LA  - English
UR  - https://www.proquest.com/wire-feeds/archiving-pandemic-uta-libraries-project/docview/2425490542/se-2?accountid=15756
AN  - 2425490542
DB  - ProQuest One Academic
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Aeducation&fmt=journal&genre=article&issn=&volume=&issue=&date=2020-07-20&spage=&title=University+Wire&atitle=Archiving+the+Pandemic%3A+UTA+Libraries+project+preserves+community+experiences+with+COVID-19&au=&isbn=&jtitle=University+Wire&btitle=&id=doi:
KW  - Web archiving
KW  - Archives & records
KW  - Library collections
KW  - Web sites
KW  - Students
KW  - Archivists
KW  - COVID-19
KW  - Coronaviruses
KW  - General Interest Periodicals--United States
KW  - Interviews
KW  - Mindfulness
KW  - Oral history
KW  - Pandemics
KW  - Quarantine
ER  - 

TY  - JOUR
TI  - Appraisal Talk in Web Archives
AU  - Summers, Ed
T2  - Archivaria
AB  - The Web is a vast and constantly changing information landscape that by its very nature seems to resist the idea of the archive. But for the last 20 years, archivists and technologists have worked together to build systems for doing just that. While technical infrastructures for performing web archiving have been well studied, surprisingly little is known about the interactions between archivists and these infrastructures. How do archivists decide what to archive from the Web? How do the tools for archiving the Web shape these decisions? This study analyzes a series of ethnographic interviews with web archivists to understand how their decisions about what to archive function as part of a community of practice. It uses critical discourse analysis to examine how the participants’ use of language enacts their appraisal decision-making processes. Findings suggest that the politics and positionality of the archive are reflected in the ways that archivists talk about their network of personal and organizational relationships. Appraisal decisions are expressive of the structural relationships of an archives as well as of the archivists’ identities, which form during mentoring relationships. Self-reflection acts as a key method for seeing the ways that interviewers and interviewees work together to construct the figured worlds of the web archive. These factors have implications for the ways archivists communicate with each other and interact with the communities that they document. The results help ground the encounter between archival practice and the architecture of the Web.Alternatív absztrakt:
Le Web est un paysage informationnel vaste et en changement constant qui, par sa nature même, semble s’opposer à l’idée de l’archive. Pourtant, depuis les vingt dernières années, les archivistes et technologues ont travaillé de concert afin de bâtir des systèmes qui feraient exactement ça. Bien que les infrastructures technologiques pour archiver le Web ont été abondamment étudiées, on en sait étonnamment peu à propos des interactions entre les archivistes et ces infrastructures. Comment les archivistes décident de ce qui sera archivé du Web? Comment les outils d’archivage du Web modèlent leurs décisions? La présente étude analyse une série d’entretiens ethnographique avec des archivistes du Web afin de comprendre comment leurs décisions concernant ce qui doit être archivé s’articulent en fonction d’une communauté de pratique. Elle utilise l’analyse critique du discours pour examiner comment l’utilisation du langage par les participants joue un rôle dans leurs processus de prise de décision d’évaluation. Les résultats suggèrent que les politiques et le positionnement des archives sont reflétés dans la manière dont les archivistes parlent de leurs réseaux de relations personnelles et organisationnelles. Les décisions d’évaluation sont l’expression des relations structurelles d’une archive et des identités de l’archiviste, qui sont forgées au cours des relations de mentorat. L’introspection agit comme méthode essentielle pour voir la façon dont les intervieweurs et les interviewés travaillent de concert pour construire les mondes façonnés des archives du Web. Ces facteurs ont des répercussions sur les façons dont les archivistes communiquent entre eux et interagissent avec les communautés qu’ils documentent. Ces résultats aident à ancrer la rencontre entre la pratique archivistique et l’architecture du Web.
DA  - 2020///Spring
PY  - 2020
VL  - 89
SP  - 70
EP  - 103
LA  - English
SN  - 03186954
UR  - https://www.proquest.com/scholarly-journals/appraisal-talk-web-archives/docview/2518362480/se-2?accountid=15756
AN  - 2518362480
DB  - ProQuest One Academic
L2  - http://linksource.ebsco.com/linking.aspx?sid=ProQ%3Alibraryscience&fmt=journal&genre=article&issn=03186954&volume=89&issue=&date=2020-04-01&spage=70&title=Archivaria&atitle=Appraisal+Talk+in+Web+Archives&au=Summers%2C+Ed&isbn=&jtitle=Archivaria&btitle=&id=doi:
KW  - Web archiving
KW  - Digital archives
KW  - Library And Information Sciences
KW  - Archivists
KW  - Infrastructure
KW  - Decision making
ER  - 

TY  - JOUR
TI  - Az Országos Széchényi Könyvtár Webarchívumának 2020-as újdonságai.
AU  - Drótos, László
T2  - New features of the National Széchényi Library’s web archive in 2020.
AB  - The paper describes the latest developments of the web archiving project launched in 2017 in the National Széchényi Library (NSZL) and the organizational, legal and infrastructural changes affecting the project. It also covers the results achieved in preserving different types of websites, archiving problems, and the software used. It summarizes the efforts to promote the topic, presents the international contact points, and finally lists the goals set for 2021 by the staff of the newly established Web Archiving Department in the NSZL. [ABSTRACT FROM AUTHOR]
DA  - 2021/00//
PY  - 2021
DP  - EBSCOhost
IS  - 1
SP  - 31
EP  - 38
J2  - Library Review / Konyvtari Figyelo
SN  - 00233773
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=149857717&lang=hu&site=ehost-live
DB  - Library, Information Science & Technology Abstracts
KW  - Web archiving
KW  - Web archives
KW  - Internet
KW  - National libraries
KW  - Websites
KW  - Preservation
KW  - Development plan
KW  - Goal (Psychology)
KW  - Hungary
KW  - International cooperation
KW  - National library
KW  - Web development
ER  - 

TY  - JOUR
TI  - Az OSZK webarchívumának újdonságai.
AU  - Drótos, László
T2  - Library Review / Konyvtari Figyelo
DA  - 2020/00//
PY  - 2020
DP  - EBSCOhost
IS  - 1
SP  - 67
EP  - 73
J2  - Library Review / Konyvtari Figyelo
SN  - 00233773
UR  - http://search.ebscohost.com/login.aspx?direct=true&db=lxh&AN=143065215&lang=hu&site=ehost-live
DB  - Library, Information Science & Technology Abstracts
ER  - 

TY  - JOUR
TI  - Együttmûködési lehetôségek a webarchiválás területén.
AU  - Visky, Ákos László
T2  - Opportunities for collaboration in the field of web archiving.
AB  - This is the published version of a presentation at the online workshop “404 Not Found – Who is to preserve the internet?”. It describes a collaboration model established within the framework of the Public Collection Digitization Strategy (KDS) between the Web Archive of the National Széchényi Library (NSZL) and some main county libraries. The advantages of collaboration and the co-ordination of workflows among the partners in the field of web archiving are presented. Some further professional discussions and presentations are also mentioned along with proposals for further research and training activities related to NSZL’s Web Archive. [ABSTRACT FROM AUTHOR]
DA  - 2021/00//
PY  - 2021
DP  - EBSCOhost
VL  - 67
IS  - 1
SP  - 39
EP  - 45
J2  - Library Review / Konyvtari Figyelo
SN  - 00233773
UR  - http://ojs.elte.hu/kf/article/view/2297
DB  - Library, Information Science & Technology Abstracts
KW  - Web archiving
KW  - Web archives
KW  - Public libraries
KW  - Digitization
KW  - Internet
KW  - National libraries
KW  - Preservation
KW  - Development plan
KW  - Hungary
KW  - National library
KW  - Co-operation
KW  - National archives
ER  - 

TY  - JOUR
TI  - Szubjektív tapasztalatok. Egy online konferencia hasznáról és káráról
AU  - Németh, Márton
AU  - Visky, Ákos László
T2  - Könyv, Könyvtár, Könyvtáros
DA  - 2021/00/06/
PY  - 2021
VL  - 30
IS  - 1.
SP  - 16
EP  - 25.
J2  - 3K
UR  - http://ojs.elte.hu/3k/article/view/1722
Y2  - 2021/08/04/
ER  - 

TY  - PCOMM
TI  - Exploring special web archives collections related to COVID-19: The case of the National Széchényi Library in Hungary
AU  - Németh, Márton
A2  - Geeraert, Friedel
DA  - 2020///
PY  - 2020
LA  - English
M3  - WARCnet Papers
UR  - https://cc.au.dk/fileadmin/user_upload/WARCnet/Geeraert_et_al_COVID-19_Hungary.pdf
Y2  - 2021/08/04/
ER  - 

TY  - JOUR
TI  - Egyedi mentésekre szolgáló webarchiváló szoftverek
AU  - Drótos, László
AU  - Németh, Márton
T2  - Könyv, Könyvtár, Könyvtáros
DA  - 2021/00/05/
PY  - 2021
VL  - 29
IS  - 12.
SP  - 3
EP  - 11
J2  - 3K
UR  - http://ojs.elte.hu/3k/article/view/1371
Y2  - 2021/08/04/
ER  - 

TY  - JOUR
TI  - Az idő fogságában Ki őrzi meg a közösségi médiát?
AU  - Drótos, László
T2  - Tudományos és Műszaki Tájékoztatás
DA  - 2021/00//undefined
PY  - 2021
VL  - 68
IS  - 7
SP  - 428
EP  - 439
J2  - TMT
LA  - magyar
SN  - 1586-2984
UR  - https://tmt.omikk.bme.hu/tmt/article/view/13062
DB  - TMT OJS Archívum
Y2  - 2021/08/04/
ER  - 

TY  - CHAP
TI  - Web museum, web library, web archive The responsibility of public collections to preserve digital culture
AU  - Drótos, László
AU  - Németh, Márton
T2  - The Power of Reading: Proceedings of the XXVI Bobcatsss Symposium
A2  - Petrovska, Lelde
A2  - Īvāne-Kronberga, Baiba
A2  - Meldere, Zane
CY  - Riga
DA  - 2018/00//undefined
PY  - 2018
SP  - 124
EP  - 126
LA  - English
PB  - The University of Latvia Press
SN  - 978-9934-18-353-9
UR  - http://bobcatsss2018.lu.lv/files/2018/08/BOBCATSSS_2018_TheProceedings.pdf
Y2  - 2021/08/04/
ER  - 

TY  - JOUR
TI  - A webarchiválás nemzetközi környezete. Mozaikok az IIPC 2019 kongresszusról
AU  - Németh, Márton
T2  - Könyv, Könyvtár, Könyvtáros
DA  - 2018/00//undefined
PY  - 2018
VL  - 27
IS  - 12
SP  - 23
EP  - 27
J2  - 3K
LA  - magyar
SN  - 2732-0375
UR  - http://epa.oszk.hu/01300/01367/00309/pdf/EPA01367_3K_2018_12_023-027.pdf
ER  - 

TY  - JOUR
TI  - Az OSZK web-archiváló kísérleti (pilot) projektjének eredményei és egy üzemszerűen működő magyar webarchívum terve
AU  - Drótos, László
AU  - Moldován, István
T2  - Könyvtári Figyelő
DA  - 2019/00//undefined
PY  - 2019
VL  - 65
IS  - 1
SP  - 38
EP  - 51
J2  - KF
LA  - magyar
SN  - 00233773
UR  - http://epa.oszk.hu/00100/00143/00355/pdf/EPA00143_konyvtari_figyelo_2019_01_038-051.pdf
Y2  - 2021/08/04/
ER  - 

TY  - CHAP
TI  - Using semantic microformats for web archiving – an initial project conception
AU  - Németh, Márton
T2  - Nové trendy a východiská pri budovaní LTP archívov: zborník príspevkov zo 4. medzinárodnej konferencie o dlhodobej archivácii Bratislava, 5. 11. 2019
A2  - Tomková, Katarina
CY  - Bratislava
DA  - 2019///
PY  - 2019
SP  - 31
EP  - 38
LA  - English
PB  - Univerzitná knižnica v Bratislave
SN  - 978-80-89303-76-2
UR  - https://cloud.ulib.sk/index.php/s/JITVgF9TmRruFtC
Y2  - 2019/12/20/
ER  - 

TY  - BLOG
TI  - Webarchívum a nemzeti könyvtárban
AU  - Drótos, László
T2  - OSZK blog
DA  - 2018/08/29/
PY  - 2018
LA  - magyar
M3  - blog
UR  - https://nemzetikonyvtar.blog.hu/2018/08/29/mini_webarchivum_a_nemzeti_konyvtarban
Y2  - 2021/08/04/
ER  - 

TY  - JOUR
TI  - Hogyan tudjuk fejleszteni a webes gyűjteményünket? A Holland Nemzeti Könyvtár webarchiválási tevékenységének értékelése (2007–2017)
AU  - Németh, Márton
T2  - Könyvtári figyelő : külföldi lapszemle
DA  - 2018///
PY  - 2018
VL  - 64
IS  - 4
SP  - 434
J2  - KF
LA  - magyar
SN  - 0023-3773
UR  - http://epa.oszk.hu/00100/00143/00354/pdf/EPA00143_konyvtari_figyelo_2018_04_623-672.pdf
Y2  - 2021/08/04/
ER  - 

TY  - JOUR
TI  - 404 Not Found – Ki őrzi meg az internetet?
AU  - Latorcai, Csaba
T2  - Könyvtári Figyelő
AB  - Az egyre növekvő mennyiségű digitális tartalom és az internethasználat általános elterjedése megköveteli, hogy a digitális térben keletkezett adattartalom, a digitális múlt tartós és biztonságos megőrzése megvalósuljon a tudományos feldolgozás, a jövő nemzedékei számára történő átörökítés és társadalmi hasznosítás érdekében – hangsúlyozta a cikk szerzője, Latorcai Csaba, az Emberi Erőforrások Minisztériuma közigazgatási államtitkára a 2020. december 2-i „404 Not Found – Ki őrzi meg az internetet?” című, az Országos Széchényi Könyvtár által rendezett online workshopon tartott előadásában. A digitális térben keletkezett adatok szakszerű és biztonságos megőrzéséhez, hasznosításához megfelelő szakmai háttérre van szükség, amelyet az Országos Széchényi Könyvtár, mint a magyar nemzet könyvtára biztosít. Az Országos Széchényi Könyvtár informatikai fejlesztéséhez szükséges források biztosításáról a Kormány 2016-ban határozatot hozott. Az előkészítő munkálatok lezárultak, a nemzeti könyvtár felkészült a webarchiválási feladatok folyamatos ellátására, és Kormányhatározat, illetve törvénymódosítás teremti meg 2021. január 1-től a webarchiválás jogi és finanszírozási kereteit. A nemzeti könyvtár a hazai könyvtárak minél szélesebb körét bevonva, követve és alakítva a nemzetközi szakmai tendenciákat, megvalósítja a hungarikumnak minősülő webes tartalom tartós megőrzését, használatra bocsátását, biztosítja annak tudományos feldolgozását.
DA  - 2021//17/prilis
PY  - 2021
VL  - 67
IS  - 1.
SP  - 28
EP  - 30
J2  - KF
LA  - magyar
SN  - 1586-5193
UR  - http://ojs.elte.hu/kf/article/view/2295
Y2  - 2021/08/04/
ER  -