Data Management Plans Meeting Apr 21 2015

From CASRAI

Attendance

   Anna Clements, St Andrews (Chair)
   Catherine Jones, Science & Technology Facilities Council
   Rachel Proudfoot, University of Leeds
   Hardy Schwamm, Lancaster University
   Veerle Van Den Eynden, University of Essex
   David Baker, Casrai
   Sheri Belisle, Casrai

Agenda

   Welcome and call to order
   Review actions from last meeting
   New template format for dictionary prep (David)
       MoSCoW method for defining importance.
   Ethics use case profile (David)
       Data Collections vs. Data Sets
   Next steps
   AOB

Supporting Materials

   Previous Minutes
   WG Charter
   Ethics Data Profile

Agreed Actions

    Changes to Objects, definition reworking, additional comments columns.
       Action owner: Casrai
       Completed by: Done
   Review definitions and element list – comment or suggest changes to finalise use case profile for dictionary.
       Action owner: WG
       Completed by: May 4

Discussion

Meeting brought to order at 2:11 PM BST.

Actions from last meeting. Comments coming in on profile, reformat complete, will be discussed today.

Agenda Item 3 – New Profile Format

Content isn’t too different but changes to how its organised for prep to import to dictionary and to simplify and focus discussion on requirements by limiting each discussion to a single use case. This limits clutter caused by trying to assess too many uses cases at once. By incrementally adding things to the dictionary, allows for example the list of elements to reference more that are already in the dictionary, while still allowing definitions etc to be refined. Its a more orderly, sequential way of working through the process of adding many elements to the dictionary.

New format broken down with an overview tab that provides the definition of the use case and a definition of the purpose of the profile. Elements list broken into Casrai terminology – every element is an attribute that has an obvious and natural object – this aids reuse of dictionary in and between working groups. The lists tab is for any attribute that is confined by a controlled vocabulary – its list values reside there. Each value will also have a definition and be entered into the dictionary therefore no ambiguity about list terms. Elements tab references the list to which its associated. Lists are not mapped one-to-one in the dictionary. If an attribute in a specific case is actually controlled by a list, the attribute’s definition is separate from the list definition because there will be cases where the same list values are used by other attributes. Rather than repeat value entries with slightly different definitions. Over time, a list may only be used once, but more often lists will be able to denote other attributes elsewhere and greater value is gained by reusing components this way. List name will be different from the attribute name for that reason. It uniquely describes the purpose of the list, not the function of the attribute where its being used.

Could there be multiple lists that fit the same attribute depending on what’s needed/how they’re used? The policy is emerging at this point. We haven’t hit that case yet. If a case comes that’s like that, we’ll find a way to support it as its likely to occur again. If different groups have different lists (output types is an example), then what we’re working towards is harmonisation. In a scenario that’s specific to, for example, the humanities where a group wants to filter the master list, it could be supported by a sub-list. When that need arises, we’ll be able to determine how its workable.

Sample data comes with conventions. Intent is similar to how “ipsum lorem” text is used when trying to do layouts. Casrai can try to find real sample data, then clean it and anonymise it, or we can use the method used by s/w developers which is to generate fake data for illustrating concepts. We’ve chosen the latter method mostly because its less resource intensive. So you’ll see fake names and text etc, and use it as a glimpse of what you’d get in practice. Another convention is for any unique identifier – we’ll be presenting those as GUIDs or UUIDs (globally or universally unique IDs). They aren’t the format of ORCIDs or ISNIs etc, but they fill the space of a unique ID. Dates will be defaulted to ISO so don’t necessarily represent what you’d see in implementation.

Chunking data is not to define or indicate how database tables are locally structured, nor do we want new profiles to force local storage changes, we’re trying to find a business-friendly modelling level where ambiguity is reduced about what a certain piece of information is about. We’ve introduced the object/attribute concept not to say that this is object oriented programming or to force any particular style of storage, its just a simple way to let business users unambiguously agree that, for example, the ID is of the PI, not the organisation she works at. Can be done many different ways. The PI could be part of the project – that’s not wrong but the idea is that the PI’s ID is not integrally part of the project.

Template uses MoSCoW method. Its better than mandatory or optional, must = mandatory, could = optional, and we add should for aspirational elements – ones that can improve the process.

Agenda Item 4 – Ethics Use Case and Research Data Collection Versus Dataset

Some orgs speak in terms of sets and others of collections. This was Casrai’s effort to chunk of the data (sets and collections) but it needs work. Some orgs use sets and collections synonymously. Research Data Alliance work was circ’ed to the group on their efforts to define datasets. Trying to distinguish between the two might be complicating things too much. DMPs at the moment are very vague and switch between sets and collections to be archived arbitrarily. Should we be more structured and defined as a group? From Casrai’s perspective, if data collection and dataset are in practice synonymously, this group should pick one or the other to use then in the definition, or in the extended definition, should give clarity to the fact that both terms are used and how they’re related.

Casrai’s thought (which could be wrong) was that a data collection contains data sets and a data set is contained in a data collection – that was the relationship we used in our “chunking”. So for example, a certain kind of data from one instrument over one year would be a set, then that data, combined with that from instruments worldwide, over a longer time frame, would be the data collection. Risk with the term collection is that it can have more than one meaning. Collection can be about the process rather than the data itself so its ambiguous. Is there any common use of the noun collection that can be added to the definition? If grouped data sets can be defined by another term, that could also be used.

Different understandings across disciplines – data base, file, set, collection, – can’t be clearly distinguished. We need a single term with an unambiguous definition. Safest, at this point, for the group to agree to use dataset, as an object for which a number of attributes are important, and once the profile is entered into the dictionary, and its used and implemented, and other use cases arise then it can be addressed more deeply. Action: remove Research Data Collection, use Research Dataset for all attributes; define it accordingly.

Some of the attributes seem to be about procedure. Should they be about the dataset? It could result in duplication but proper definitions will address that. Some might suggest other objects, but were just trying to identify what’s needed for this use case, its a level of normalization. The next step might be enough people coming in for public review that feel the same and a future iteration might address that.

Previous ethics format was transferred to this one. So many definitions look more like comments. They need some reformatting to make them into proper definitions. Action: Casrai will fix the definitions to address grammar, followed by the working group reviewing them.

Plan is to create sheets for each use case. Items and definitions will then be repeated – existing elements will have links to the dictionary so work won’t be repeated. Group should edit one use case at a time. Comment field will also be added (between B&C) so existing definitions can be challenged. Action: Sheri will add column.

If there’s concerns about an attribute – such as transport/security of data. In review of profile, watch for certain things: is an attribute that’s important actually missing – insert a row with a label/attribute, and a definition or notes so the group and the Casrai team can create a proper object/attribute/ definition for it; is an attribute not quite right – use comments column to express issues. Sheet is informal.

Agenda Item 5 – Next Steps

Pilot has been extended but work not likely to stop. We should work as quickly as possible to get all use cases into the dictionary. It may mean the group is still looking at use cases in July – we expect to finish.

Agenda Item 6 – AOB

Jisc/Casrai survey sent round by Jisc, to be completed by April 24.

Next meeting is May 21, 2:00 PM BST.