Harvests of the national web space – MNMKK OSZK Webarchívum

Beyond to selective (thematic or event-based) harvests we try to make snapshot harvests once or twice a year about a representatively large part of the Hungarian web space. It means to harvest several hundred thousand websites and blogs from the starting page to two or three level depth – excluding files by large size in order to spare storage space. The initial URLs are collected from several resources: public lists of URL addresses from the Hungarian domain, those links that include Hungarian domains and sub-domains we could find by earlier harvests, the .hu “zonefile” from the Internet Archive, and those website addresses that have selected for thematic collections or recommended by the corresponding form (these include addresses beyond the .hu domain also!)

The spreadsheet below contains the main data of the completed web space harvests. The materials of these archived collections are being stored in a closed archive in order to guarantee long-term preservation and research activities in the future. Detailed statistics for each harvest can be viewed on this page, and the web addresses given to the crawler as a starting point can be searched here.

Start of harvest	End of harvest	Number of seed URLs	Number of downloaded URLs	Downloaded content	Stored new content (without compression)
2026-01-12	2026-01-30	942 323^*	129 894 558	9 583 GB	5 299 GB
2025-07-28	2025-08-21	942 323	119 833 675	7 691 GB	4 649 GB
2024-06-24	2024-07-23	865 982	129 728 757	8 788 GB	5 453 GB
2024-01-11	2024-02-03	1 371 617^*	138 409 426	8 766 GB	6 531 GB
2023-10-04	2023-10-31	992 303^**	126 850 047	7 830 GB	6 451 GB
2022-12-02	2022-12-20	1 371 617^*	158 416 570	8 261 GB	6 687 GB
2022-06-24	2022-07-20	1 371 617	174 282 398	8 891 GB	6 132 GB
2021-12-26	2022-01-03	433 863^*	69 356 724	5 372 GB	2 414 GB
2021-07-07	2021-07-12	433 863	71 878 955	5 495 GB	3 128 GB
2020-12-30	2021-01-04	251 230	47 881 581	4 140 GB	2 300 GB
2020-06-30	2020-07-05	269 430	46 380 598	3 608 GB	2 400 GB
2019-12-23	2020-01-02	246 819	110 367 190	6 387 GB	6 387 GB
2018-09-24	2018-09-28	291 078	172 639 350	10 287 GB	9 138 GB

^* The seed list hasn’t changed compared to the previous one.
^** The seed list hasn’t changed compared to the previous one, but the harvesting of mass-generated subdomains did not take place due to technical reasons.