< Documentation for the eFamily Schema
eFamily

Documentation for the eFamily Schema

Version:1.0

Schema Version: 2004-08-14

Last Modification:10-Mar-2005 12:51:31

Contributers: Rob Finn, Dave Howorth, Andreas Prlic, Sameer Velankar


Background

The eFamily project is designed to integrate the information contained in five of the major protein databases (CATH, Interpro, MSD, Pfam and SCOP). The databases CATH, SCOP, Interpro, and Pfam contain information describing protein domains. The domain definitions of the former two databases are based on protien structure, while the latter two domain databases are based on protein sequence. The MSD database is the primary data warehouse for data exchange and integration, containing the fundamental mapping from protein sequence (UniProt) to structure (PDB).

Although the different domain databases offer related views of proteins, it is often difficult for biologists to navigate from protein sequence to protein structure and back again. The aim of this project is to provide the scientific community with a coherent and rich view of protein families that allow users to seamlessly navigate between the worlds of protein structure and protein sequence. In this project we are developing data exchange mechanisms and services that exploit grid utilities.

The aim of the eFamily schema is allow the different domain definitions and mappings (between sequence and structure) to be exchanged using the same basic file format. This schema is designed to wrap up single database entries that can be downloaded from an ftp site or, better still, exported using Web services (see the eFamilyService schema documentation for more details). As well as the domain boundary definitions, the schema also allows any associated sequence or structural alignments to be encapsulated in xml.


The Schema

Below we describe the eFamily schema and the theory behind its design. The schema is complex, but hopefully if you are reading this document you already have some understanding of xml and schemas. If you are unsure of anything, look at w3schools or send a mail to us.

The eFamily schema imports the following two schemas:

The eFamily schema includes the alignment schema.

Schema Location:http://www.efamily.org.uk/xml/efamily/2004/08/14/eFamily.xsd

How does the documentation work ? We will walk through the schema from the root element to the leaves of the schema tree. When an element is of simple type or of complex Type but unbranched, the element will be described as it occurs in its parent. However, when an element is of complex type and branched the element will be described in the parent, then the child element will become parents and described in detail under a separate heading. After walking through the schema, there will be a full view of the schema in the summary section. Finally, there are some links to examples from each member database from eFamily.


The Root: Entry element

Structure of the Entry element

Format:

<entry>
	<rdf:RDF>see below</rdf:RDF>
	<entryDetail
		dbSourcve="The information source"
		property="The property of the CDATA">
			entryDetail information
	</entryDetail>
	<entity>see below</entity>
	<alignment>see below</alignment>
</entry>

Element:<entry>

(required; once only) The root element is entry. This represents any database entry. As such, the attributes for this element define the database source.

attributeGroup:dbRef

This attribute group is used for referring to a database.

attribute:dbSource

(required). The name of the database.

attribute:dbVersion

(required). The version of the database or of the entry (e.g. PDB REVDAT). The date sgould be in the format YYYY-MM-DD.

attribute:dbCoordSys

(optional). The co-ordinate system used by the database. This is not always the same as the database. For example, Pfam uses UniProt as an underlying co-ordinate system.

attribute:dbAccessionId

(required). The database entry accession id. For example, SCOP have sunid, Pfam accessions(PF01020).

attribute:date

(required) Must be in the standard xml date format, and is the date that the document was produced.

Element:<entryDetail>

(optional, one or more).Allows additional information about the entry to be included.

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

Element:<rdf:RDF>

(optional, once only when used).

The rdf:RDF element

The Resource Description Framework (RDF) is an imported schema that allows the encapsulation of metadata.

It is beyond the scope of this documentation to describe the imported schema, but more information can be found out here.

The entity element

The entity element encapuslates one instance of a domain or in the case of mapping a PDB chain or UniProt sequence.

Format:

<entity 
	type="definition type" 
	entityId="A unique identifier for the entity in the document">
		<entityDetail 
			dbSource="the information source" 
			property="the property of the CDATA">
				entity information that falls outside the schema.
		</entityDetail>
	<segment>see below</segment>
</entity>

Element:<entity>

(required, one or more). The entity element represent two very different classes of data. The first could be one or more domain definitions for a database entry. Alternatively, an entity can represent one or more chains in a PDB file.

attribute:type

(required). The type of the entity. For entries form the domain databases the type is domain. Note, a domain may be heterogenous in its composition (e.g. RNA and protein). The other types are protein, RNA and DNA. These are used for defining the chain types in a PDB file.

attribute:entityId

(required). This should be a unique identifier for the entity in the entry.

Element:<entityDetail>

(optional, one or more).

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

The /entry/entity/segment element

The segment defines a continuous region of an entity. Two or more segments can be used to model discontinuous domains. In the mapping section where chains in PDB where they map to more than one UniProt sequence (e.g. Chimeras). Note, this does not reflect disordered PDB regions as they are continous but not observed. Problems may arise from PDB numbering system...... as a start Resnum of -1 to a end Resnum of 10 does not mean that there are 12 residues involved!! Thus, we strongly recommend that you cross map structural domains to MSD numbering.

Format:

<segment 
	segId="a identification of the segment" 
	start="co-ordinate system start" 
	end="co-ordinate system end" >
		<listResidue>see below</listResidue>
		<listMapRegion&>see below</listMapRegion>
		<segmentDetail 
			dbSource="the information source" 
			property="the property that the CDATA refers to">
				information about the segment 
		</segmentDetail>
</segment>

Element:<segment>

(required, one or more).

attribute:segId

(required) Idenitifer for the segment. Should be unique to the list of segments within an entity.

attributeGroup:region

The following attributes are bundled together to form a group of attributes that are capable of defining a region on something using a predefined co-ordinate system.

attribute:start

(optional). The start co-ordinate.

attribute:end

(optional).The end co-ordinate.

Element:<segmentDetail>

(optional, one or more).

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

The /entry/entity/segment/listResidue element

A container for a set of residue elements

Format:

<listResidue>
	<residue>see below</residue>
	...
	<residue>see below</residue>	
</listResidue>

Element:<listResidue>

(optional, once only if used). Encapsulates a list of residue elements.

The /entry/entity/segment/listResidue/residue element

The residue element allows description/information and cross mappings to other databases about a single residue to be conveyed.

Format:

<residue 
	dbResNum="The residue number" 
	dbResMon="The residue name" >
		<crossRefDb 
			dbSource="the database being cross referenced" 
			dbVersion="The cross referenced database version" 
			dbCoordSys="The cross referenced database co-ordinate system" 
			dbAccessionId="The cross reference database identifier" 
			dbResNum="cross referenced residue number" 
			dbResName="cross referenced residue name" 
			dbChainId="cross referenced chain id/"> 
		<residueDetail 
			dbSource="the infromation source" 
			property="the property that the CDATA refers to">
				information about the residue 
		</residueDetail>
</residue>	

Element:<residue>

The residue element describes information about the residue.

attributeGroup:resRef

attribute:dbResNum

(required). The database residue number.

attribute:dbChainId

(optional). This is used to specify a chain id when using a PDB based co-ordinate system. e.g. dbChainId="A"

attribute:dbResName

(required). The name of the residue. In the UniProt system methionine is M, whereas in PDBresnum system it is MET.

Element:<crossRefDb>

Allows the defined residue to be cross referenced to another database.

attributeGroup:dbRef

This attribute group is used for referring to a database.

attribute:dbSource

(required). The name of the database.

attribute:dbVersion

(required). The version of the database or of the entry (e.g. PDB REVDAT). The date sgould be in the format YYYY-MM-DD.

attribute:dbCoordSys

(optional). The co-ordinate system used by the database. This is not always the same as the database. For example, Pfam uses UniProt as an underlying co-ordinate system.

attribute:dbAccessionId

(required). The database entry accession id. For example, SCOP have sunid, Pfam accessions(PF01020).

attributeGroup:resRef

attribute:dbResNum

(required). The database residue number.

attribute:dbChainId

(optional). This is used to specify a chain id when using a PDB based co-ordinate system. e.g. dbChainId="A"

attribute:dbResName

(required). The name of the residue. In the UniProt system methionine is M, whereas in PDBresnum system it is MET.

Element:<residueDetail>

Describes some information about the residue. For example Not_observed or active_site.

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

The /entry/entity/segment/listMapRegion element

Allows a part of the segment to be mapped to another database.

Format:

<listMapRegion>
	<mapRegion 
		start="The start point being mapped" 
		end="The end point being mapped";>
			<db dbSource="the database being mapped referenced" 
				dbVersion="The mapped database version" 
				dbCoordSys="The mapped database co-ordinate system" 
				dbAccessionId="The mapped database identifier" >	

				dbChainId="The co-ordinate system chain id" 
				start="The start point being mapped" 
				end="The end point being mapped";>
				<dbDetail 
					dbSource="the infromation source" 
					property="the property that the CDATA refers to">
							information about the mapping or mapped database 
				</dbDetail>
			</db>
		</mapRegion>
</listMapRegion>

Element:<listMapRegion>

(optional, once only when used). The element contains a list of regions from within the segment that map to other databases.

Element:<mapRegion>

(required, one or more). Defines a region from within the segment that one wished to map to.

attributeGroup:region

The following attributes are bundled together to form a group of attributes that are capable of defining a region on something using a predefined co-ordinate system.

attribute:start

(optional). The start co-ordinate.

attribute:end

(optional).The end co-ordinate.

Element:<db>

(required, once only). Provides the cross reference to the defined mapRegion.

attributeGroup:dbRef

This attribute group is used for referring to a database.

attribute:dbSource

(required). The name of the database.

attribute:dbVersion

(required). The version of the database or of the entry (e.g. PDB REVDAT). The date sgould be in the format YYYY-MM-DD.

attribute:dbCoordSys

(optional). The co-ordinate system used by the database. This is not always the same as the database. For example, Pfam uses UniProt as an underlying co-ordinate system.

attribute:dbAccessionId

(required). The database entry accession id. For example, SCOP have sunid, Pfam accessions(PF01020).

attribute:<dbChainId>

(optional). This is used to specify a chain id when using a PDB based co-ordinate system. e.g. dbChainId="A". In the case of unlabelled chains, the standard representation expected is " ".

attributeGroup:region

The following attributes are bundled together to form a group of attributes that are capable of defining a region on something using a predefined co-ordinate system.

attribute:start

(optional). The start co-ordinate.

attribute:end

(optional).The end co-ordinate.

Element:<dbDetail>

(optional, one or more). Describes some information about the database or the mapped region in that databse. For example an alternative name.

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

The /entry/alignment element

This part of the schema allows the modeling of alignments, whether they are structural or sequence based. The objects that are aligned should be defined in the entity section of the alignment. However, there is no cross validation built into the schema. Admittedly, there is some redundancy in this part of the schema as the alignment schema is imported into both the eFamily schema and the dasalignment schema. Although slightly different methods underly the way alignments are accessed, there is little point in reinventing the wheel to produce an alignment section. Also, once the code is written to export the alignments, it can be used to produce dasalignments or eFamily alignments.

Format:

	<alignment>
	<alignObject dbAccessionId="someid" 
		objectVersion="version" intObjectId="internalId" type="
		objectType" dbSource="someSouce" 
		dbVersion="version" dbCoordSys="coords"  >
		<alignObjectDetail dbSource="someSouce" 
		property="property">
			some details about the object. e.g. description, etc. 
		</alignObjectDetail> 
		<sequence>
			SEQUENCESEQUENCESEQUENCE
		</sequence>	
	</alignObject>
	<score methoName="scorename" value="scorevalue">
	<block blockScore="score" blockOrder="position"> 	
		<segment intObjectId="intObjectId" start="start" end="end" orientation="+">
			<cigar>9I5M10D</cigar>
		</segment>
	</block> 	
	<geo3D intObjectId="intObjectId">
		<vector x="xCoord" y="yCoord" z="zCoord"/>
		<matrix>
			<max11 coord="float"/>
			<max12 coord="float"/>
			<max13 coord="float"/>
			<max21 coord="float"/>
			<max22 coord="float"/>
			<max23 coord="float"/>
			<max31 coord="float"/>
			<max32 coord="float"/>
			<max33 coord="float"/>
		</matrix>
	</geo3D>	
	</alignment>

	

Element:<alignment>

(optional; one or more when used) everything below belongs to an alignment.

Element:><alignObject>

(required; two or more) A description of the objects that are aligned.

attribute:objectVersion

(required) version of Object. e.g. CRC64 checksum for protein sequences.

attribute:intObjectID

(required) internal, unique name name for this object. This is used in the SEGMENT section to identify to which object an alignment belongs to.

attribute:type

(optional) a type for this object.e.g. DNA, PROTEIN, STRUCTURE, etc.

attributeGroup:dbRef

This attribute group is used for referring to a database.

attribute:dbSource

(required). The name of the database.

attribute:dbVersion

(required). The version of the database.

attribute:dbCoordSys

(optional). The co-ordinate system of used by the database. This is not always the same as the database. For example, Pfam uses UniProt as an underlying co-ordinate system.

attribute:dbAccessionId

(required). The database entry accession id. For example, SCOP have sunid, Pfam accessions(PF01020).

Element:<sequence>

(optional; one) the sequence of this object. Clients generally should use the DAS - SEQUENCE request to get the seqeuence, so this is optional.

Element:<alignObjectDetail>

(optional; zero or more) details about the object

attributeGroup:detail

The following attributes are bundled together to form a group of attributes that are capable of describing additional information about a particular node.

attribute:dbSource

(required). The database where the information about this source orginates.

attribute:property

(required). A description of the CDATA. For example, if you wanted to give an alternative id for a database, e.g. a UniProt ID , then the <someDetail dbSource="UniProt" property="id">VAV_HUMAN</someDetail>

Element:score>

(optional; zero or more) a score for this alignment. an alignment can be described with several scores. each of the scores is described with the following

attributes:

attribute:methodName

(required) the name of the score, e.g. number of equivlanet residues (eqr), e-value, etc.

attribute:value (required) the value of the score e.g. 99, 10e-22 , etc.

Element:<block>

(required; one or more) a block of the alignment. an alignment can consist of one or more blocks. Often it will be only one block, since the CIGAR type of writing the alignment information allows to encode blocks.

attributes:

attribute:blockScore

(optional) some kind of score for a block

attribute:blockOrder

(required) The number of block in the alignment.

Element:<segment>

attributeGroup:region

The following attributes are bundled together to form a group of attributes that are capable of defining a region on something using a predefined co-ordinate system.

attribute:start

(optional). The start co-ordinate.

attribute:end

(optional).The end co-ordinate.

attribute:intObjectID

(required). The internal ID of the object.

attribute:strand

(optional). Strand of the object

Element:<cigar>

(optional). encoding of the alignment. The "cigar" string provides an efficient way to encode an alignment. 15M2D3I e.g. means that the 15 residues of the sequence are Matched (aligned), then there are 2 Deletions, followed by 3 Insertions. Since in some situations the alignment will just consist of an aligned block, the cigar string is optional. If it is missing the alignment is ungapped and ranges from "start" to "end" of the SEGMENT.

Element:<geo3D>

(optional) geometrical operation on 3D object. if the objects to be aligned are three dimensional objects this section defines how one of the needs to be shifted and rotated in order to be superimposed with the others.

attribute:intObjectID

(required) The internal ID of the object.

Element:<vector>

(required, one) the shift vector the x,y, and z attributes are describing the shift vector.

Element:<matrix>

the "container" for the rotation matrix elements. ><max11> - <max33> the rotation matrix


Summary

Okay, we have walked through the schema element by element. Lets put the whole schema together.Click here to view the whole schema.

Note: If I get round to it, it would be excellent to write an image map for the schema linking back to this page......


Some Examples

A SCOP domain

A Pfam domain definition and alignment

A mapping from MSD/PDB to UniProt:18GS, 1F5O.

Coming soon....

A CATH domain

A structural alignment.

A mapping from UniProt to MSD.