1. Ecoli database: Protein Localization Sites


1. Source:

Kenta Nakai, institute of Molecular and Cellular Biology, Osaka, University, 1-3 Yamada-oka, Suita 565 Japan. See: "A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins", Paul Horton & Kenta Nakai, Intelligent Systems in Molecular Biology, 109-115.

2. Information:

3. Attribute Information.

1. Sequence Name: Accession number for the SWISS-PROT database

2. mcg: McGeoch's method for signal sequence recognition.

3. gvh: von Heijne's method for signal sequence recognition.

4. lip: von Heijne's Signal Peptidase II consensus sequence score.Binary attribute.

5. chg: Presence of charge on N-terminus of predicted lipoproteins.Binary attribute.

6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.

7. alm1: score of the ALOM membrane spanning region prediction program.

8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.

9. Class Distribution. The class is the localization site:

cp = 0 (cytoplasm)                                                                143
im = 10 (inner membrane without signal sequence)                 77
pp = 20 (perisplasm)                                                             52
imU= 30(inner membrane, uncleavable signal sequence)         35
om = 40 (outer membrane)                                                    20
omL = 50 (outer membrane lipoprotein)                                 5
imL = 60(inner membrane lipoprotein)                                    2
imS = 70(inner membrane, cleavable signal sequence)             2

4. Database: 336 instances.

Ecoli Database file has 336 - 10 = 326 instances. The data required for 100% accuracy is:

Class             #Required             #Actual

cp = 0               800                     143
im = 10             800                     77
pp = 20             800                     52
imU= 30           800                     35
om = 40            800                     20
omL = 50         800                      5
imL = 60           800                     2
imS = 70           800                     2

For class cp, im, and pp, the data might support an educated guess. As we will see, despite the severely limited number of the instances provided in this case, the DM 2.5 will have an accuracy rate of 60 - 70 % overall, which is more than expected. The DM cannot be used for the other 4 classes, due to the poverty of data in this database (2 instances for class imS, ...).

Below are the first 5 instances:
 
ALKH_ECOLI
0.67
0.39
0.48
0.5
0.36
0.38
0.46
cp
AMPD_ECOLI
0.29
0.28
0.48
0.5
0.44
0.23
0.34
cp
AMY2_ECOLI
0.21
0.34
0.48
0.5
0.51
0.28
0.39
cp
APT_ECOLI
0.2
0.44
0.48
0.5
0.46
0.51
0.57
cp
ARAC_ECOLI
0.42
0.4
0.48
0.5
0.56
0.18
0.3
cp

Click here to see ecoli database file.
 

The first 5 rows (All cp-class) and the last 5 rows (All pp-class) are used for question file. Below are the 10 instances:

Question                                                                             Answer
 
AAT_ECOLI
0.49
0.29
0.48
0.5
0.56
0.24
0.35
cp
ACEA_ECOLI
0.07
0.4
0.48
0.5
0.54
0.35
0.44
cp
ACEK_ECOLI
0.56
0.4
0.48
0.5
0.49
0.37
0.46
cp
ACKA_ECOLI
0.59
0.49
0.48
0.5
0.52
0.45
0.36
cp
ADI_ECOLI
0.23
0.32
0.48
0.5
0.55
0.25
0.35
cp
TREA_ECOLI
0.74
0.56
0.48
0.5
0.47
0.68
0.3
pp
UGPB_ECOLI
0.71
0.57
0.48
0.5
0.48
0.35
0.32
pp
USHA_ECOLI
0.61
0.6
0.48
0.5
0.44
0.39
0.38
pp
XYLF_ECOLI
0.59
0.61
0.48
0.5
0.42
0.42
0.37
pp
YTFQ_ECOLI
0.74
0.74
0.48
0.5
0.31
0.53
0.52
pp

Click here to see ecoli question file.

5. results

Click "Integer/+ Predict" to get the following answer file:

=================== Beginning =====================
 
 

0.49 0.29 0.48 0.5 0.56 0.24 0.35

Possibility Confidence*Probability

1 744500

------------------------------------------------------

1

0.07 0.4 0.48 0.5 0.54 0.35 0.44

Possibility Confidence*Probability

1 69000

------------------------------------------------------

1

0.56 0.4 0.48 0.5 0.49 0.37 0.46

Possibility Confidence*Probability

1 1.01896e+06

------------------------------------------------------

1

0.59 0.49 0.48 0.5 0.52 0.45 0.36

Possibility Confidence*Probability

1 384360

------------------------------------------------------

1

0.23 0.32 0.48 0.5 0.55 0.25 0.35

Possibility Confidence*Probability

1 320000

------------------------------------------------------

1

0.74 0.56 0.48 0.5 0.47 0.68 0.3

Possibility Confidence*Probability

1 30948

10 2180

20 18500

------------------------------------------------------

8

0.71 0.57 0.48 0.5 0.48 0.35 0.32

Possibility Confidence*Probability

1 77714

------------------------------------------------------

1

0.61 0.6 0.48 0.5 0.44 0.39 0.38

Possibility Confidence*Probability

1 267422

20 170000

------------------------------------------------------

8

0.59 0.61 0.48 0.5 0.42 0.42 0.37

Possibility Confidence*Probability

1 284088

------------------------------------------------------

1

0.74 0.74 0.48 0.5 0.31 0.53 0.52

Possibility Confidence*Probability

1 13876

30 1760

20 94500

------------------------------------------------------

18

=================== End ==========================
 
 

6. Analysis

Correct: 1, 2, 3, 4, 5, 10

Incorrect: 7, 9.

Between: 6, 8.

As indicated from beginning, there was a severely limited number of instances provided in this database:

Class             #Required             #Actual

cp = 0               800                     143
im = 10             800                     77
pp = 20             800                     52
imU= 30          800                     35
om = 40            800                     20
omL = 50         800                      5
imL = 60           800                     2
imS = 70           800                     2

The data limits the DecisionMaker's ability to provide you with 100% accuracy. The predictions are better than expected based on the limited amount of data.