OSZKbot is only crawling public websites and usually not the whole websites with maximum deepness. Certainly, not crawling websites needs prior registration to use. It cannot download database contents if those are only available via a simple query form. Files with extremely big size, streaming media, downloadable video contents are also excluded. Robot is working on polite mode that means that it is not overloading a web server with rapidly frequent requests, it is not affecting the server’s latency for human users. Because of that, the harvesting process of a larger website sometimes could take several days. Harvesting is constantly repeating several times per year in order to crawl new or modified files.
Harvested web content has become a part of a non-public archive collection in WARC format. This archive is being planned to use for research purposes by respecting privacy and copyright, in the library intranet by dedicated workstations without the possibility to copy any of the content. Only the most important metadata (name, URL, topic) of those websites that can be found in this non-public collection will be publicly available together with a 300 pixel thumbnail about the starting page of each website (by non-readable quality).
In case of some websites the content (copyright) owners would be asked to sign a contract in order to include the harvested content of these websites in a public collection. This is a public demo collection primarily offering a showcase of capabilities and limits of web archiving technology. Among the archived version in each case a link points to the original URL of the archived website. Furthermore Google robot is being excluded from the web archive, in this way no one can found the archived version of a website by Google instead of the original one. To become a part of the public archive can be requested by the website owners by filling up a recommendation form.