2. Information:
1. Sequence Name: Accession number for the SWISS-PROT database
2. mcg: McGeoch's method for signal sequence recognition.
3. gvh: von Heijne's method for signal sequence recognition.
4. lip: von Heijne's Signal Peptidase II consensus sequence score.Binary attribute.
5. chg: Presence of charge on N-terminus of predicted lipoproteins.Binary attribute.
6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
7. alm1: score of the ALOM membrane spanning region prediction program.
8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
9. Class Distribution. The class is the localization site:
cp = 0 (cytoplasm)
143
im = 10 (inner membrane without signal sequence)
77
pp = 20 (perisplasm)
52
imU= 30(inner membrane, uncleavable signal sequence)
35
om = 40 (outer membrane)
20
omL = 50 (outer membrane lipoprotein)
5
imL = 60(inner membrane lipoprotein)
2
imS = 70(inner membrane, cleavable signal sequence)
2
4. Database: 336 instances.
Ecoli Database file has 336 - 10 = 326 instances. The data required for 100% accuracy is:
Class #Required #Actual
cp = 0
800
143
im = 10
800
77
pp = 20
800
52
imU= 30
800
35
om = 40
800
20
omL = 50 800
5
imL = 60
800
2
imS = 70
800
2
For class cp, im, and pp, the data might support an educated guess. As we will see, despite the severely limited number of the instances provided in this case, the DM 2.5 will have an accuracy rate of 60 - 70 % overall, which is more than expected. The DM cannot be used for the other 4 classes, due to the poverty of data in this database (2 instances for class imS, ...).
Below are the first 5 instances:
ALKH_ECOLI |
0.67
|
0.39
|
0.48
|
0.5
|
0.36
|
0.38
|
0.46
|
cp |
AMPD_ECOLI |
0.29
|
0.28
|
0.48
|
0.5
|
0.44
|
0.23
|
0.34
|
cp |
AMY2_ECOLI |
0.21
|
0.34
|
0.48
|
0.5
|
0.51
|
0.28
|
0.39
|
cp |
APT_ECOLI |
0.2
|
0.44
|
0.48
|
0.5
|
0.46
|
0.51
|
0.57
|
cp |
ARAC_ECOLI |
0.42
|
0.4
|
0.48
|
0.5
|
0.56
|
0.18
|
0.3
|
cp |
Click here to see ecoli database file.
The first 5 rows (All cp-class) and the last 5 rows (All pp-class) are used for question file. Below are the 10 instances:
Question
Answer
AAT_ECOLI |
0.49
|
0.29
|
0.48
|
0.5
|
0.56
|
0.24
|
0.35
|
cp |
ACEA_ECOLI |
0.07
|
0.4
|
0.48
|
0.5
|
0.54
|
0.35
|
0.44
|
cp |
ACEK_ECOLI |
0.56
|
0.4
|
0.48
|
0.5
|
0.49
|
0.37
|
0.46
|
cp |
ACKA_ECOLI |
0.59
|
0.49
|
0.48
|
0.5
|
0.52
|
0.45
|
0.36
|
cp |
ADI_ECOLI |
0.23
|
0.32
|
0.48
|
0.5
|
0.55
|
0.25
|
0.35
|
cp |
TREA_ECOLI |
0.74
|
0.56
|
0.48
|
0.5
|
0.47
|
0.68
|
0.3
|
pp |
UGPB_ECOLI |
0.71
|
0.57
|
0.48
|
0.5
|
0.48
|
0.35
|
0.32
|
pp |
USHA_ECOLI |
0.61
|
0.6
|
0.48
|
0.5
|
0.44
|
0.39
|
0.38
|
pp |
XYLF_ECOLI |
0.59
|
0.61
|
0.48
|
0.5
|
0.42
|
0.42
|
0.37
|
pp |
YTFQ_ECOLI |
0.74
|
0.74
|
0.48
|
0.5
|
0.31
|
0.53
|
0.52
|
pp |
Click here to see ecoli question file.
5. results
Click "Integer/+ Predict" to get the following answer file:
=================== Beginning =====================
0.49 0.29 0.48 0.5 0.56 0.24 0.35
Possibility Confidence*Probability
1 744500
------------------------------------------------------
1
0.07 0.4 0.48 0.5 0.54 0.35 0.44
Possibility Confidence*Probability
1 69000
------------------------------------------------------
1
0.56 0.4 0.48 0.5 0.49 0.37 0.46
Possibility Confidence*Probability
1 1.01896e+06
------------------------------------------------------
1
0.59 0.49 0.48 0.5 0.52 0.45 0.36
Possibility Confidence*Probability
1 384360
------------------------------------------------------
1
0.23 0.32 0.48 0.5 0.55 0.25 0.35
Possibility Confidence*Probability
1 320000
------------------------------------------------------
1
0.74 0.56 0.48 0.5 0.47 0.68 0.3
Possibility Confidence*Probability
1 30948
10 2180
20 18500
------------------------------------------------------
8
0.71 0.57 0.48 0.5 0.48 0.35 0.32
Possibility Confidence*Probability
1 77714
------------------------------------------------------
1
0.61 0.6 0.48 0.5 0.44 0.39 0.38
Possibility Confidence*Probability
1 267422
20 170000
------------------------------------------------------
8
0.59 0.61 0.48 0.5 0.42 0.42 0.37
Possibility Confidence*Probability
1 284088
------------------------------------------------------
1
0.74 0.74 0.48 0.5 0.31 0.53 0.52
Possibility Confidence*Probability
1 13876
30 1760
20 94500
------------------------------------------------------
18
=================== End ==========================
6. Analysis
Correct: 1, 2, 3, 4, 5, 10
Incorrect: 7, 9.
Between: 6, 8.
As indicated from beginning, there was a severely limited number of instances provided in this database:
Class #Required #Actual
cp = 0
800
143
im = 10
800
77
pp = 20
800
52
imU= 30 800
35
om = 40
800
20
omL = 50 800
5
imL = 60
800
2
imS = 70
800
2
The data limits the DecisionMaker's ability to provide you with
100% accuracy. The predictions are better than expected based on the limited
amount of data.