Not everyone is glad to write a XML parser from scratch, so followings are some one-line commands to parse the xml file. and output them as a Tab-separated text.
You can save and open it with Microsoft Excel or use any command-line utillity such as grep and awk to do more data minings conveniently.
Maybe you can’t find the dataset you want because you choose an alias. For example, if you search a factor with JMJ, you will get nothing returned because we used Jarid2 to name that factor.
In the situation above, you may need to download the whole vocabulary to find the standard name of the factor you’re looking for, then search the database using that keyword.
The following commands will parse the xml file into a vocabulary table with ID and Name:
Get the whole Factor vocabulary:
curl http://cistrome.org/cisapi/get/f -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Get the whole CellLine vocabulary:
curl http://cistrome.org/cisapi/get/cl -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Get the whole CellPop vocabulary:
curl http://cistrome.org/cisapi/get/cp -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Get the whole CellType vocabulary:
curl http://cistrome.org/cisapi/get/ct -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Get the whole TissueType vocabulary:
curl http://cistrome.org/cisapi/get/t -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Get the whole DiseaseState vocabulary:
curl http://cistrome.org/cisapi/get/ds -s | awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}'
Or do all the above 6 commands together:
meta_data=`for i in f cl cp ct t ds;do printf "http://cistrome.org/cisapi/get/$i\n"; done `
curl $meta_data|awk -F '[<>]' '/id/{printf "%s\t",$3}/name/{print $3}' > all_volcabulary
The following commands will parse all the Sample data with the prefix ESM, SRX, or CSM into a big table:
curl http://cistrome.org/cisapi/get/s2?uid=esm http://cistrome.org/cisapi/get/s2?uid=srx http://cistrome.org/cisapi/get/s2?uid=csm > todo
cat todo|awk -F '[<>]' '{if (/factor/) printf ("\n%s\t",$3); else if (!/sample/) printf ("%s\t",$3)}END{printf ("\n")}'
Write these lines into a script, input the factor name as the first parameter:
curl -s http://cistrome.org/cisapi/get/s2?fname=$1| \
awk -F '[<>]' '{if (/factor/) printf ("\n%s\t",$3); else if (!/sample/) printf ("%s\t",$3)}END{printf ("\n")}'| \
cut -f 11| \
grep GSM > $1.tab