ICM Language Reference : Parsing XML example: DrugBank.

The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6826 drug entries including 1431 FDA-approved small molecule drugs, 133 FDA-approved biotech (protein/peptide) drugs, 83 nutraceuticals and 5211 experimental drugs. Additionally, 4435 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. Read more information: here

The most complete drug information (target, transporter, carrier, and enzyme information ) is provided in XML format. Chemical structures are provided separately in SDF format

The following example will demonstrate how to deal with such data in ICM.

Read the XML data directly from the website
```
read xml "http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip" name="drugbank"
```
The command above will create collection object "drugbank".
Examine the content
```
icm/def> Name( drugbank )
#>S string_array
drugs
```
This shows us that collection contains a single root node called "drugs"

Going further gives the following:

icm/def> Name( drugbank["drugs"] )
#>S string_array
drug
partners
xmlns
xmlns:xs
xs:schemaLocation
icm/def> Type(  drugbank["drugs","drug"] )
array
icm/def> Type(  drugbank["drugs","partners"] )
collection
icm/def> Name(  drugbank["drugs","partners"] )
#>S string_array
partner
icm/def> Type(  drugbank["drugs","partners","partner"] )
array

Which means that drugbank["drugs","drug"] is an array where each entry contains the information about particular drug. In addition there is an another array drugbank["drugs","partners","partner"] which contains an additional information about targets.

Examine individual entries
```
drugbank["drugs","drug"][1]
drugbank["drugs","drug"][2]
drugbank["drugs","partners","partner"][1]
drugbank["drugs","partners","partner"][2]
```
The default output format for displaying collection is JSON which gives you nicely formated easy-to-read text. Looking at the output it's easy find the fields of interest.
WARNING: do not try to show the entire array into the terminal window because it'll take very long and most likely you'll need to kill the window.
Fetching individual fields
Let's create a table with a single column containing an array with drug cards.
```
add column drugs drugbank["drugbank","drug"]
```
Hint: In GUI you can resize all simultaneously by holding 'CTRL' key which resizing an individual row.
The single field can be extracted by providing dot separated path to it. Note that fields which contain non-alphanumeric characters must be quoted.
- A.drugbank - OK
- A.'drugbank-id' - must be quoted
```
# extracts drugbank-id into separate column
add column drugs function="A.'drugbank-id'[1]['$']" name="drugbank_id"

# extracts name into separate column
add column drugs function="A.name" name="name"
```
Fetching multi-value fields
Multiple properties will be extracted as an array for each drug entry.
```
# display targets information for the second entry
drugs.A[2]["targets","target"]
# extract array of partner IDs for each drug into separate column
add column drugs function= "A.targets.target.partner" name="partner_id"  
Type( drugs.partner_id[2] )  # array
```
This way to extract multiple properties has one problem. For entries with only one property the result will be not array but rather individual value (E.g: Type(Type( drugs.partner_id[1] ). This will prevent from the unified access to the column in the future. In such cases it's recommended to use ':' operation instead of '.'. The result of this operation will always be an array (even for single entries).
```
delete drugs.partner_id
add column drugs function="A.targets.target:partner" name="partner_id"  # will create an array for all entries.
Type( drugs.partner_id[1] )  # array (even for single entries)
```
Querying XML fields
Let's say you want to extract a value of the property with name which start with "logP". It can be done similar to the ICM-table filtering operations. The only difference is that colon ':' (instead of dot) must be used to separate field name
The general filtering syntax:
```
<field1>.<field2>:<queryField> <op> <value>
```
The following operations are supported in array filtering: ==,!=,>,<,>=,<=,~,!~
Example:
```
# query and extract logP property
add column drugs function="(A.'experimental-properties'.property:kind ~ '^logP').value[1]" name="logP"
```
Note that some entries contain text information ('0.61 [HANSCH,C ET AL. (1995)]') so the result column will not be automatically converted to rarray. You can convert it explicitly:
```
# empty or 'bad' entries will be marked as 'ND'
add column drugs Rarray( drugs.logP ) name="logPNum" 
delete drugs.logP
```
The other example will extract Wikipedia links:
```
add column drugs \
  function="(A.'external-links'.'external-link':resource == 'RxList')[1].url"\
  name    ="rxlist"
```

Joining with information from drugbank["drugs","partners","partner"]

For each drug entry we have list of partner IDs which refers to information from drugbank["drugs","partners","partner"] array. To join them we need to add this array to the other table and extract fields which will be used in join.

# creates a table and put partner entries there.
add column partners drugbank["drugs","partners","partner"] 
# extract ID column which will be used to join with drugs.partner_id
add column partners function= "A.id" name="id"   
# extract uniprot-id from the "external-identifiers" array using query functions
add column partners \
 function= '(A."external-identifiers"."external-identifier":resource ~ "UniProtKB")."identifier"[1]' \
 name    = "uniprot_id"

Finally we need to join drugs.partner_id with partners.id.

join drugs.partner_id partners.id column ="drugs.*,partners.uniprot_id" name="drugs"

Note that since drugs.partner_id contains multiple entries for each row the result drugs.uniprot_id will also contain multiple entries for each row. You can set special format with set format command to execute a special action when particular uniprot entry is clicked.

# load sequence 
set format drugs.uniprot_id \
 "<!--icmscript name=\"1\"\nread sequence swiss \"http://www.uniprot.org/uniprot/%1.txt\"\n--><a href=#_>%1</a>" 
# or simply go to the website
set format drugs.uniprot_id "<a href=http://www.expasy.org/uniprot/%1>%1</a>"

Joining with chemical structures The final step would be to add a chemical structure information.

# read SDF from the website
read table mol "http://www.drugbank.ca/system/downloads/current/structures/all.sdf.zip" name="drugs_chem"
# join 'mol' column
join drugs.drugbank_id drugs_chem.DRUGBANK_ID  column="drugs.*,drugs_chem.mol" name="drugs"

A little bit more rearrangements and your table is ready to be exported to SDF file.

move drugs.mol 1  # move structure column to the first position
delete drugs.A    # delete drug-card information
delete drugs.partner_id # delete partner id information
write table mol drugs "mydrugs.sdf"