Open Access Reporting Meeting Apr 15 2015

From CASRAI

Attendance

   Nick Anderson, Symplectic Inc
   Ben Johnson, HEFCE
   Stuart Lawson, Jisc
   George Macgregor, University of Strathclyde
   Cecy Marden, Wellcome Trust
   Valerie McCutcheon, University of Glasgow
   Balviar Notay, Jisc
   Ben Ryan, EPSRC
   David Baker, Casrai
   Sheri Belisle, Casrai

Agenda

   Welcome and call to order
   Review action items from last meeting
   Main Discussion:
       Review profile for finalisation (David)
       Specification for OA reporting V4 (Balviar)
   Date of next meeting
   AOB

Supporting Materials

   Previous minutes
   Draft data profile
   Charter/work plan
   E2E Workshop Feedback
   CASRAI Specification for OA Reporting_V4

Agreed Actions

   Changes to Data Profile based on discussion
       Action owner: Sheri
       Completed by: Done
   Try to finalise use case
       Action owner: WG
       Completed by: April 24
   Review and comment on V5 of the Specification
       Action owner: WG
       Completed by: April 24

Discussion

Welcome, meeting brought to order at 2:04 PM BST.

Review actions from last meeting. New profile will be discussed, addresses comments. Meeting dates sorted. Specification is on the agenda.

Agenda Item 3a – New Profile

In order to move into next phase of preparing profiles and their component elements for adding to the dictionary, before wider review, we’re using a new approach. Focus will be on one use case at a time, does not mean other use cases will be ignored. Merely trying for a sequential and manageable approach to putting last several months work into draft area of dictionary to be consumed and read for specific comment and review. When the first use case goes into the dictionary, is not the last chance the groups will have to modify labels or definitions if they’re discovered in later use cases. Its just to be able to focus on specific requirements one at a time. Getting difficult to manage all requirements – this approach will address all of them but allow us to better focus on individual elements and their definitions.

What goes into dictionary are “elements”. Elements broken into Object/Attribute. First name is attribute of PI, not object oriented programming, merely business level abstractions for re-usability. Many cases where attributes will be used by other profiles either in UK or elsewhere. By dividing terms this way, allows natural structural record-based organisation within dictionary. Its a list of terms, but those terms have assumptions made about them based on their nature – hence the object/attribute division. Working in that space we’re not dictating local storage of objects and attributes, we’re simply using object/attribute lists to organise logically the info we’re talking about. Local implementers will need to ask how that collection can be fulfilled in their local systems. We aren’t telling them how to structure in local tables or anything like that. We’re making clear business information requirements, that are technology agnostic. Tech community might question labels, but that’s not what they’re for. Just simplifying business agreement and then techs can interpret as needed for models or databases.

This group/use case has produced 30+ attributes. There have been more than that discussed but the point of this step is to focus on this APC use case. We need to determine if we have the right elements, the right labels and definitions, and whether we have the right balance in object structure. When doing an abstractions, there are always several ways it can be done. What we’ve got is the Casrai team’s interpretation of the requirements, broken into natural objects and attributes. Whether an application ID is an attribute of the funding award, or the project depends on clarity and agreement of the business users/subject experts. Also whether its clear and ambiguous to tech for modelling. Casrai Help tab explains what’s in each tab along with assumptions and limitations.

There will be a separate spreadsheet for each use case; elements already in the dictionary will be linked. Once this one is complete and in the dictionary, we’ll have 30+ new elements in the pool that can use used for the next use case so we should see more links. Doesn’t mean those objects can’t be better defined, as that’s part of the process, but does simplify/lessen the task. Risk in trying to include many use cases in one view is greater confusion. These new dictionary entries will be marked draft as they’ll still need a series of reviews triggered by the working group. Once the group decides it represents accurate output, draft is cleared and they become permanent entry.

Sample data will have certain types of fields – dates, IDs, long text fields, short text fields – that will continually crop up in our work. The date, for example, will always be extended by its object to define a date for what, but fundamentally it’s a date. TO have a view of sample data will help frame what might a spreadsheet or an xml file look like that’s complied with a particular profile instance so we can know if it will have everything needed as data is consumed. Many ways to generate sample data, some are time consuming and expensive. We’re not presenting data from a volunteer database that’s been anonymised. This is literally fake data, many s/w development activities need fake data to test functions etc. Its the equivalent of lorum ipsum in publishing/user interface demos. Some compromises are made, if we had the resources to access real sample data, cleared and anonymised, from RCUK or HEFCE or institutions it would be easier to glance at and understand but its a non-trivial activity to do that and fit it into monthly cycles and quick turnaround so we use a fake data generator. Whenever a date appears, we use the ISO standard, which goes down to minutes, doesn’t say we need to capture to the second, they can be parsed down as needed, its just a generic reference to ISO formats and can be looked at as shorthand. For unique IDs, we’re using GUIDs, or UUIDs, these are globally or universally unique identifiers. Its a generic, widely-used s/w tool that can generate a long unique number. We know DOIs or PUBMED IDs don’t look like this, but again, its a shorthand that will represent a unique ID.

In this profile, there are 3 types of unique IDs, PMC, DOI, and ISSN. They’ll each look different than each other in local technical constraints, but here, when its sample data fulfilling a unique ID element, our generator will give you a UUID so there isn’t a blank.

The elements list is an abstraction that can be useful, but is hard to imagine in the ‘real world’. Sample data is a way of taking the list of elements and presenting it as records. Its not meant as a tech guide, or specification for implementers, its meant as an aid of understanding for subject experts. Right now in a Casrai pilot, GUIDs are being used to test and prototype the data exchange concepts, not to test how they parse certain IDs. Format details for tech working with specific IDs would come from the organisations that created them.

Working group members already find the sample data helpful, for example to compare to existing data, has also helped identify gaps in the profile.

Add ID to applicant name (ORCID) as well as publisher (CrossRef?). Pattern emerging around unique IDs in Casrai – whenever something needs to be uniquely identified, there will be at least 2 attributes that need to be captured. One is ID Type – what types do the community see as valuable? For person, so far we’re seeing ISNI and ORCID, and for organisations, maybe ISNI, FundREF. Unless you want to force people to use one or another, there are still certain general types in wide use around the world. There might be 5, 6, 12… if we can support them, then we know what to expect in the next field, which is the ID itself. That common pattern, if we apply it every time we want to give multiple options for uniquely identifying it allows us to do it but doesn’t force people to change their current approach until such time as the community evolves and the list can be deprecated. Action: Add ID Type and ID Fields to name and publisher.

Re definitions, refs as author, or staff in terms of who’s made the application. Use case is more about the application then who’s doing it. Do we need to determine who may or may not apply? Is there a need for a field that records if the applicant was eligible. In 80% its the author, but often it isn’t, sometimes an admin applies on behalf of the author, so maybe we shouldn’t narrow scope. Is it the person applying that needs to be recorded or the person on whose behalf the application is made? The latter would be the one reported against. Record affiliated author, keep whomever contacted them in the notes. Suggest changing defs to say made by eligible representative – helps but need author name somewhere too. Change def to author? Can also use an extended definition for more clarity. What about keeping applicant name, and add author name, make department “author” as well? Action: Change to Author Name and Author Department, capture contact in notes field, refine definitions to ‘author’ and for Date, change ‘author staff member’ to ‘eligible rep’.

Suggestion to replace ‘…go for gold’ with ‘…to pay APC’ in the definition for Estimated Charge. Agreed. Action: update estimated charge definition.

We’re journal article focussed with this use case, there are some cases where APCs are paid for conference papers, should we address that or turn to later? Suggestion is that article processing charges be the focus, as its very specific, but we know there are other kinds of open access-related charges that are also an extended use case. Based on this group’s setting priority for it, we could tackle that in a subsequent step. How can we take the basic structures that are outlined in the specifics of article processing charges and expand that to allow them to be applied to other types of outputs including conference papers. In that process we may introduce a variant profile that has elements like output type where you can select. We’d tweak things as needed.

Publisher’s unique IDs, is there an authoritative list? Also issue with definition. Distinction needs to be drawn between repository and publisher. Definition needs adjusting. Is there an existing mechanism for identifying publishers or are there 2 or 3 that may be competing? Then we could use that ID Type and ID pattern. Not to replace the name, but add to it. Action: Add Publisher ID Type and ID as aspirational elements, fix definition (remove reference to repository).

Aside – introducing importance to capture – using MoSCoW approach – Must, Should, Could, Won’t. Must is mandatory, Could is optional, and Should would be aspirational. Link to another approach:http://www.ietf.org/rfc/rfc2119.txt. Action: add this column to the elements list.

For IDs, maybe stress in the definition that ORCID is preferable?

Out of time for discussion, but comments should continue in the data profile. Add your thoughts, approach can be informal, the team will make changes and seek clarification where needed. Action: try to resolve this use case by end of next week.

Agenda Item 3b – Specification

Update sent to group. Fuller proposal is being put together. Delays due to holidays but hopefully next week can be funded. Peter West lined up to do work. Currently doing EPrints work. Will be supported by Sheridan. Sheridan/Peter would like to talk to Casrai to make sure work is being done together and eliminate duplication. End date for Casrai is June, specification is July due to extra outputs and delays.

Group still needs to review the latest update (V5). Use case approach useful, hope consultant will take account of that and have as an early step determining what use cases need to be worked through. Casrai can ensure progress isn’t short-circuited by this process and that work is done in tandem.

Next meeting is May 20. No other business.

Adjourned at 3:08 PM BST.