Table of Contents

  1. Preface
  2. Intended "Audience"
  3. Creating a CLARIN metadata component
  4. Using a CLARIN metadata component
  5. Contact

Preface

The idea of the CLARIN metadata components is to modularise metadata information so that it is easy for researchers to build their own metadata schema that exactly suits their needs. There is a number of existing components which can be found on the CLARIN web site, but if you don't find the one you need there, you can easily create your own component.

Intended "Audience"

At the moment, there are no special tools for creating or using the CLARIN metadata components. This means that to use the system you have to be XML savvy enough to be able to edit XML files by hand (or using an XML editor like <oXygen/>). The following instructions assume that you have a working knowledge of XML documents and XML Schemas.

Creating a CLARIN metadata component

There is already a number of CLARIN metadata components out there, and even if they are not exactly as you would like them, it is probably the easiest way most of the time to take an existing component that is already almost as you want it, modify it according to your needs and save it with a new name as a new component. This tutorial assumes that you start a new component from scratch. If you need more examples to clarify a certain step, it may help you to look at existing components and how the thing you're having problems with is done there.

Every component starts with the XML declaration, which is followed by the root element CMD_ComponentSpec.

<?xml version="1.0" encoding="UTF-8"?>

<CMD_ComponentSpec xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="general-component-schema.xsd">

If you do not keep the general-component-schema.xsd in the same folder as your new component, you have to modify the path for the schema location accordingly.

The new component then starts with a <Header> element. This element has no function yet, so you can leave it empty for the time being, like this <Header />.

After that comes the actual component itself. The tag is <CMD_Component> and it could look like this

<CMD_Component name="Actor" CardinalityMin="0" CardinalityMax="unbounded">

The component has a name (Actor in this example) and two (optional) attributes that determine how often it may appear in components or profiles (see below) that use it. The example component Actor above can appear as often as you like (because the maximum cardinality is unbounded), but on the other extreme, it does not have to appear at all (because the minimal number of occurences is zero). You can enter any non-negative integer value for these attributes, with the lower bound being 0 and the upper bound being unbounded. It should be obvious that the value for CardinalityMax should be at least as high as the value for CardinalityMin.

Additionally there are the attributes id and ConceptLink.

ConceptLink is used to link a CMD_Component (or a CMD_Element) to a definition in a data category repository like isocat. It is not always possible to find a suitable concept to link to, therefore this attribute is optional. It should, however, be used as often as possible. To use the attribute you simply link to the appropriate concept in the data category repository, for example like this

<CMD_Component name="TextTMD" ConceptLink="http://www.isocat.org/datcat/CMD-000">

id is used to link a component to a specific resource. See below for how exactly this works.

A component has four different kinds of child elements:

Please note, that you have to use these child elements exactly in this order. Each of them will now be described in detail.

AttributeList

Components - as well as elements - will be mapped to XML elements in the resulting schema, so it is possible to define attributes for both CMD_Components and CMD_Elements. This is done by using the special sub element AttributeList. Here is an example from the descriptions IMDI component for an element with an attribute:

<CMD_Element name="description" ValueScheme="string" CardinalityMin="0" CardinalityMax="unbounded">
    <AttributeList>
        <Attribute>
            <Name>LanguageID</Name>
            <Type>string</Type>
        </Attribute>
    </AttributeList>
</CMD_Element>

As you can see, all attributes for a CMD_Component (or a CMD_Element - they both work the same way in this regard) are part of the wrapper element AttributeList. The AttributeList can include as many attributes as you like. Each attribute has to have a name, which can be any string of alphanumeric characters, and a type. The type has to be one of boolean, decimal, float, string or anyURI. Please note, that case does matter here, so a type of, for example, String or FLOAT is not acceptable. (Please see the XML Schema specification for more information about these types.)

As an alternative to the five built-in types mentioned in the previous paragraph it is also possible to define your own types. To do this you use the child element <ValueScheme> instead of <Type>:

<AttributeList>
    <Attribute>
        <Name>sex</Name>
        <ValueScheme>
            <enumeration>
                <item>male</item>
                <item>female</item>
            <enumeration>
        </ValueScheme>
    </Attribute>
</AttributeList>
How exactly the element <ValueScheme> works is described in detail below.

CMD_Element

These are simple elements. They have a name and a ValueScheme. The latter is used to determine the data type of the element's content. There are two ways to determine a CMD_Element's ValueScheme. If it is a simple data type, you can simply give it as the value of the attribute ValueScheme like this

<CMD_Element name="EthnicGroup" ValueScheme="string"/>
There are five simple data types defined for such a use: boolean, decimal, float, string and anyURI. (Please see the XML Schema specification for more information about these types.)

Alternatively you can enter a more complex, i.e. restrictive ValueScheme, by not using the attribute ValueScheme but instead using the sub element <ValueScheme> like in the following two examples:

  1. <CMD_Element name="BirthDate">
        <ValueScheme>
            <pattern>(1|2)\d{3}-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])</pattern>
        </ValueScheme>
    </CMD_Element>
    
  2. <CMD_Element name="Sex">
        <ValueScheme>
            <enumeration>
                <item>Male</item>
                <item>Female</item>
                <item>Unspecified</item>
                <item>Unknown</item>
            </enumeration>
        </ValueScheme>
    </CMD_Element>
    

As you can see from the examples, there are two ways to restrict the possible values for a CMD_Element. You can either define a pattern (example 1) that the values for your element must confirm to. The pattern is defined using XML style regular expressions. You can use it for cases where you know that your element values will have (and should have) a certain shape, like dates as in the example.

The other option is to define a list of possible values for your element. If you know that only a (reasonably low) number of values can occur for this element, you can list them all in the ValueScheme like in example 2 above.

A CMD_Element can also have a ConceptLink attribute, which is used to link it to a definition in a data category repository like isocat. This works the same way here as it does for Components, so see the description above for details.

Additionally, you can also add an attribute list to elements. It works the same way here as it works for components, so see above for more specific instructions on how to do it.

CMD_Component

Because of the modular concept of CLARIN metadata it is of course possible to include other components in a component. In principle it is possible to define such a component which is used by another component inline. This means, you can put the entire definition of the second component into the definition of the first one, just as if you were defining an element. Such an internal definition could look like this:

<CMD_Component name="Actor" CardinalityMin="0" CardinalityMax="unbounded">
    <!-- inline component definition -->
    <CMD_Component name="ActorLanguage" CardinalityMin="0" CardinalityMax="unbounded">
        <!-- inline element definition -->
        <CMD_Element name="ActorLanguageName" ValueScheme="string"/>
    </CMD_Component>
</CMD_Component>

While this may seem as quite reasonable for tiny components such as the one in the example which only has one element, we strongly suggest, that you define each component on its own and put it in its own file. This enhances the clearness of the construction, i.e. it is easier to see at a glance how many components are used and how they are called. And it also adheres more strictly to the modular concept of the CLARIN metadata infrastructure. This means, if you define every component individually, you can more easily reuse it, and - maybe even more important - other CLARIN users can more easily reuse it.

To include externally defined components in your new component, simply use the CMD_Component attribute filename to point to the CMD_Component that you want to include.

<CMD_Component name="Actor" CardinalityMin="0" CardinalityMax="unbounded">
    <CMD_Component filename="component-actor-language.xml" />
</CMD_Component>
You have to enter the (absolute or relative) path to the included component, if it isn't stored in the same directory as the referring component. As with components that are defined inline you can also set (one of) the cardinality attributes when you integrate an external component. If the including statement and the external definition have conflicting cardinality values, the cardinality defined in the including statement will always take precedence.

Using a CLARIN metadata component

When all the components you want to use are defined (or have been defined by you like has been described above), you can now put them together into a Clarin metadata profile. A profile consists of a number of CMD_Components. It could look like this:

<?xml version="1.0" encoding="UTF-8"?>
<CMD_ComponentSpec xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="general-component-schema.xsd">

    <Header>Example of an MD profile that consists of included components</Header>

    <CMD_Component filename="example-component-text.xml" />
    <CMD_Component filename="example-component-photo.xml" />

    <!-- use a list of 0 to 4 actors, overriding the default cardinality for an Actor component -->
    <CMD_Component filename="example-component-actor.xml" CardinalityMin="0" CardinalityMax="4" />

<CMD_ComponentSpec>
The above example is a complete CLARIN metadata profile. Yours will probably be a bit longer and more complex, but don't worry. As long as it validates against the general-component-schema.xsd, everything will work out just fine. Now you simple have to apply the stylesheet comp2schema.xsl to your new profile. The stylesheet will then create a new XML Schema out of the profile. Please note that the stylesheet uses a lot of features of XSLT 2.0, so you will need an XSLT processor that is able to handle XSLT 2.0, like for example Saxon-B 9.x.

Please note, that the XML Schema resulting from the stylesheet transformation has a Header element. This is generated automatically and will always be added to any schema you create. This element is intended to hold additional information about the actual metadata instances that are created using this schema. The individual fields are the following:

Additionally, the stylesheet will also automatically create a Resources element. The sub element ResourceProxyList is basically a list of resources. For each resource you have to enter its type, a link to it and an id. This id can be referenced from components in your schema, so if you have a component Actor-Photo you can use this component's id attribute to link to the corresponding resource entry for the actual photo.

Contact

If you have questions or need more information, you can go to the CLARIN metadata toolkit website and add a comment there or you can e-mail the CLARIN MD Team directly: