FAQ | ARCHE

SUSTAINABILITY

Archive Data

You should definitely go to one of the established repositories for research data. They are made just for that. In Austria, there are already several institutional repositories available. Which one to use depends on the type of data you have, how it is to be used and also on your affiliation. You can search for a suitable repository on re3data.org.

Additionally, you should make sure to use file formats suitable for long-term preservation and provide sufficient documentation for your data (metadata) to enable others to understand your resources.

Finally, you should not only consider deposition of your data in a reliable repository but also open access for your data. Open Access is essential for reuse and thus longevity of data. Only visible and accessible data can be reused and thus be made more valuable. Many institutions already declared support of Open Access, including numerous Austrian institutions. The Open Definition provides further details and lists compatible licences. Forschungslizenzen.de gives a comprehensive overview of open and restrictive licences and provides guidance for choosing a proper licence. If you want to use one of the widespread Creative Commons (CC) licences you can use their tool to choose a licence.

Be sure to consult FAIR Data Principles to learn about recommended measures for discovery and reuse of data.

FAIR stands for Findable, Accessible, Interoperable and Reusable data and metadata. The principles which were formulated by leading stakeholders in the field (representing academia, industry, funding agencies, and scholarly publishers) recommend and describe measures to foster discovery and reuse of data. The FAIR Data Principles are meanwhile also part of official European recommendations (https://www.force11.org/group/fairgroup/fairprinciples).

As part of the CLARIAH-AT infrastructure, ARCHE is primarily intended to be a digital data hosting service for the humanities in Austria. Data from all humanities fields including modern languages, classical languages, linguistics, literature, history, jurisprudence, philosophy, archaeology, comparative religion, ethics, criticism and theory of the arts are equally welcome.

Detailed information is provided in the Collection Policy.

When in doubt, get in touch!

See our list of accepted and preferred formats for archiving.

Deposition and archiving involve work by the data provider and by ARCHE data curators. During the submission of digital resources to the repository, the data undergoes a curation process in order to ensure quality and consistency. We assist you in meeting necessary requirements for sustainable resource archiving: data have to be provided with metadata and in preferred formats, persistent identifiers (PIDs) have to be assigned, IPR issues have to be resolved and clear statements with regard to licensing and possible use of the resources are need to be made.

Deposition involves four stages, which are detailed here:

Preparation steps for the depositor before the submission
The actual submission of the data
Checks on the data by the curators after we habe received the data, which can result in the need to review and resubmit data
Actual archiving and publication of the data

In order to provide initial information about your resources for the ARCHE curators, a list of files is useful. You can create it by hand or automatically by using different tools. All main operating systems already provide this function—for example: tree and dir on Windows and ls on Linux and Mac. Alternatively you can install dedicated tools, like DROID.

Providing a licence for your data makes it reusable and clearly describes the rights you give potential re-users of your data.

You should consider open access for your data, which is essential for reuse and the longevity of your data. Only visible and accessible data can be reused and thus made more valuable. Many institutions already declared support of Open Access, including numerous Austrian institutions. The Open Definition provides further details and lists licences that comply. Forschungslizenzen.de gives a comprehensive overview of open and restrictive licences and provides guidance for choosing a proper licence. If you want to use one of the widespread Creative Commons (CC) licences you can use their tool to choose a licence.

We suggest the use of CC-BY (CC - Attribution) or CC-BY-SA (CC - Attribution-ShareAlike). When depositing software consider using specific software licences like BSD or GPL. You can use the License Selector tool to select an appropriate licence for either software or data.

ARCHE makes use of the Handle System to assign unique and persistent identifiers to the digital objects. In such a manner, every resource has a uniquely identifiable URL that will always point to the same data, wherever it might physically move in the future. The handle is especially meant for citing the resources in publications. With additional information about creators and contributors ARCHE generates a suggested citation that is displayed along with each resource.

PID stands for persistent identifier and is a unique string that is persistently assigned to a digital object. It is comparable to the concept of ISBN numbers assigned to print publications in order to identify them. A PID helps in identifying and referencing an object in a stable manner, regardless of the actual storage location. Examples for PID systems are URNs, DOIs or Handles.

Every change to the resources and metadata is stored as a new version. If the changes are substantial or if two versions of the data need to be available, a new object with a new PID should be created that is equipped and should include a link to the preceding version, which retains its PID.

ARCHE runs on the systems maintained by the Computing Centre of the Austrian Academy of Sciences (ARZ), which makes for a solid organisational and technical backing. To avoid data loss of archived data due to deterioration of physical storage, malicious threats, or other emergencies, redundancy is the key for the preservation of data. Regular backups help us to protect and restore data.

Backups of the data in the repository are performed regularly: a daily copy is stored and replicated within the internal ARZ NetApp setup on-site. In addition, the data is replicated off-site to the long-term storage in the computing centre run by Max Planck Computing and Data Facility (MPCDF) in Garching, Germany. Checks of the integrity of the copies are performed regularly. We keep at least three copies at all times, one of them off-site.

Further details are described in our Storage Procedures.

Yes, if needed. However, ARCHE will retain administrative metadata indicating that the data itself was removed. The assigned PID will also be kept and point to a tombstone page displaying the metadata.

In accordance with the advocacy of the research infrastructures and the general development with respect to Open Access, we strongly encourage the data producers to be as open as possible: publicly available data has a better chance to be picked up by fellow colleagues which is good for the reputation and the citation index. Public funding agencies increasingly require researchers to publish not only the results of their research, but also the research data.

However we are aware that the Open Access approach is not possible in all cases. IPR or ethical issues as well as strategic considerations may require more restrictive access modes. We will help you to select the right licence for your needs. If necessary, we also offer the possibility to just archive the data, without any public access.

The deposition and storage itself is free of charge. The repository is run as part of the research infrastructure as a service to the community. If the data requires further processing and extensive curation we might charge for the curation effort.

body

AVAILABILITY

Search and Use Data

The resources are published on ARCHE’s web site and can be browsed through the web interface.

Furthermore, metadata about the resources are offered for harvesting via OAI-PMH, allowing dissemination via additional channels, such as the Virtual Language Observatory, CLARIN’s central metadata catalogue.

In general, the Terms of Use apply to the use of the resources and services provided by the ARCHE. Additionally, resource-specific licences apply as stated in the description for every resource.

No. All the resources are available free of charge.

It depends. There are three basic modes of access: public, academic and restricted.

Public resources are accessible without any further restrictions. Academic access means that you have to be affiliated with an academic institution (e.g. be a member of a university). This is checked primarily via the so-called Federated (or Shibboleth) Login (see next question). If you cannot login via Shibboleth, but are in fact affiliated with an academic institution and require the resource for academic purposes, please contact us.

Some of the resources are only available on the basis of a special agreement. This is indicated by the “restricted” access mode which usually implies that you have to fill in a registration form and accept a special licence. In the worst case the resource is not available online at all. In this case, you need to contact us to find out how to get access to the resource.

Shibboleth, AAI (Authentication and Authorisation Infrastructure), or SSO (Single-Sign-On) refer to an architecture where service providers rely on identity providers to authenticate users. I.e. if users want to use a certain service (like the ARCHE) of the provider, for which they need to authenticate, they are redirected to their home institution (e.g. university) where they can login with their institutional credentials. If successful, the home institution lets the provider know that they are entitled to use the service. In short, you can login to different services with your institutional account without the need to separately register every time.

This is similar to the OpenId initiative known in the “commercial” world (login to a cool web page with your google or facebook account).

Given that this “Identity Federation” is established by academic institutions, it is implicitly assumed that if a user can login via Shibboleth, (s)he is an academic person.

The Open Archival Information System (OAIS) is a reference model developed by the Consultative Committee for Space Data Systems (CCSDS) and consists of a set of recommendations for archival systems dedicated to long-term preservation and maintenance of digital information.

The OAIS model describes six functional entities in which information packages are exchanged. These information packages either contain the original submitted information (Submission Information Package, SIP), the information prepared for archiving (Archival Information Package, AIP), or the information ready for dissemination (Dissemination Information Package).

More information can be found in publications by CCSDS, as for example in the Magenta Book.

SIP stands for Submission Information Package and represents the information package that is delivered to ARCHE for ingestion and archiving. The SIP contains the data to be stored and all necessary metadata about the package and its content.

When submitting a SIP please make sure to provide the data in formats suitable for long-term preservation and that sufficient metadata is accompanying the package.

AIP stands for Archival Information Package. It contains the metadata and the data submitted via the SIP, information about preservation and other documentation accumulated during the ingestion process. Data from the SIP might have to undergo file conversions to produce an AIP with data suitable for long-term preservation.

DIP stands for Dissemination Information Package. A DIP can be derived from one or multiple AIPs and is used to present the data and metadata to the consumer. The content of a DIP is presented in delivery formats which might be different from archival formats used in the AIP. Delivery formats are tailored to the bandwidth available and user requirements. A single file might be available in a variety of delivery formats.

RECOMMENDATIONS

Practical Tips for Depositors

body

Text & PDF
- Unicode character encoding:
  UTF stands for Unicode Transformation Format and is a set of character encodings for the Unicode character set. UTF-8 uses a byte (i.e. eight bits) to encode the characters. Other UTF-encodings like UTF-16 or UTF-32 use more than one byte per character and they can be stored with the most significant byte in first (big-endian) or last place (little-endian). Thus a Byte Order Mark (BOM) is needed, which is represented by the non-character U+FEFF. Since UTF-8 is byte oriented, a BOM is not necessary and should be avoided. An advantage in using UTF-8 is that the first 128 characters of ASCII are preserved and encoded in the same manner. See the official FAQ about UTF-8, UTF-16, UTF-32 & BOM and the IT-Empfehlungen from IANUS for further information on this topic and encoding in general.
Spreadsheets & Databases
Raster Images
Vector Images
Geospatial Data
3D Graphics & Audio/Video
General

body

PROGRESS

Technical Basis

When we were planning the implementation of ARCHE in 2017, after thorough evaluation of multiple existing solutions, Fedora Commons seemed the most suitable candidate, being widely used to run repositories around the world, and especially also multiple CLARIN Centres. However, the most widely employed version of Fedora (version 3) was announced to reach end of life and not developed any further. Thus it seemed natural to adopt the new version. Version 4 of Fedora Commons went through a complete redesign and re-implementation which abandoned many proven concepts and introduced technological decisions, which in hindsight turned out to be very problematic. Some of these decisions were revoked in an intermediate version 5 (2020) work is done on version 6, for which a stable release is expected for the beginning of 2021.

Meanwhile the problems in our solution grew. Though we were able to work around these, it came at the cost of spending time on developing workarounds. Additionally Fedora’s performance quickly deteriorated with the growing amount of data in the repository, making ingestions of larger datasets next to impossible.

During the three years that we worked with this repository, we gathered a lot of experience and learnt valuable lessons learned, allowing us to identify the absolutely crucial and necessary features we required in a repository.

When we came to the difficult decision to abandon our Fedora 4 based solution we conducted another survey of existing solutions to determine if any of them could better meet our needs. We came to the conclusion that none of the “out of the box” solutions could completely fulfil our requirements. While many open-source solutions could provide customisation and extensibility, they also came with opaque, complex components, which could presumably lead to similar frustrations that we experienced with Fedora 4, which we wanted to avoid.

As a result, we decided to develop a custom-tailored solution from scratch, serving our specific needs. We strived to make it as generic as possible, to be applicable in a multitude of scenarios and use cases.

The system is based on a very conservative technology stack: plain strictly object-oriented PHP with a PostgreSQL database to store the metadata. The overall architecture is cleanly divided into multiple components, with a clear function and well-defined APIs exposing their respective functionality.

We preserved all functionality, so that both the user interface and the APIs behave exactly the same as before, just by an order of magnitude faster, with an order of magnitude lower resource consumption.

Suggestions for more detailed information can be found in the Further Guidance link collection.

How can I save my valuable research material for the future?

What are the FAIR Data Principles?

Do you accept data from anybody?

Do you accept any kind of data?

What is the actual deposition procedure?

How can I compile a list of files for my data?

Why do I have to select a licence?

What licence should I use?

How can the archived data be cited?

What is a PID?

What if I want/need to update the archived data?

How safe is my data in ACDH-repo?

What if I want to withdraw the resources in the future? Can I delete the data?

I don't want/cannot make the data publicly available. Would you still archive them for me?

Do I need to pay to deposit the resources?

How long does the archiving process take?

Why do I have to sign a deposition agreement?

How can the archived data be found?

Can I do anything with the resources? What are the regulations regarding access?

Do I need to pay to get to the resources?

Do I need to register/login to get to the resources?

What is this Federated (or Shibboleth) Login?

What is OAIS?

What does SIP mean?

What does AIP mean?

What does DIP mean?

What is data management and why do I need a plan for it?

How extensive must metadata be for long-term archiving in ARCHE?

How must the data be structured for ARCHE?

Can I archive multiple versions of a file?

Which conventions for file names should be observed?

Which file formats does ARCHE prefer?

What legal aspects do I need to consider? What authorisations do I need?

What does copyright mean and which terms of protection apply in Austria?

What are orphan works and how can I use them?

What do related/neighbouring rights refer to?

What is the difference between a work and an object?

What is independent work?

How must personal and sensitive data be protected?

What needs to be considered with regard to property rights and terms of use?

What special provisions apply according to the Austrian Archive Act?

Which licence is suitable for what? When are additional rights notices helpful?

Why did you switch the software stack from Fedora Commons to a custom-tailored solution?