Rome University, La Sapienza Chemistry Department Rome, Italy, Europe |
Dr. Giovanni Visco, Ott 2007 Chemiometria, analisi multivariata, clustering, pattern recognition, exploratory data analysis ... |
Corso di Laurea in: Scienze Applicate ai Beni Culturali ed alla Diagnostica per la loro Conservazione Corso di laurea in: Chimica Ambientale |
previous slide, 3 |
all lessons, these slides index |
next slide, 5 |
the Data Set
well established Chemometrics or Chemometry Data Set (s)
Data set, definition
Looking for an anomalous datum in a long list of measures, analysing if that coin does part of the group assumed by us as true, trying to remove a spike in the measure of the brightness into the library, analysing if the version 3.0 of the software, writed by your friend Mario, works well and finally to check our ability to use statistical procedures; we have necessity of secure data, note, from many scientists studied and re-studied, of whom we know the problems, a Data Set.
Data set, what's inside?
Before use a data set or before download a file of measures we must know something.
- A typical dataset:
- Contents: the content as number of objects, of variables and parameters, in total, available
- Source: who has done the measures, where they have been published, the dates, a reference for find or citation
- Description: a description of the variables, of the objects, how and where the measures have been developed, the description at contour with also every other part of information which allow to structure the problems. A description of the instrument used for measure, the accuracy, and other parameters would also be necessary, today
- Variables: one to one the present variables in dataset must be described, mentioning the measure unity and also the numerical conversions if necessary
- Clas-var: description of the variables which connect the objects one to the other, for instance the time scale, a progressive number, etc.
- Classes: description of the classes already present in dates set (sick/healthy, etc)
- Main question: what is the principal question places by this dataset (with only three variables can we identify the sick/healthy one, see above), what mathematical method it is better to obtain a separation, what is the anomalous point among the others, etc.
- The files: the files which are available for your studies, use formats what one hopes will not suffer of a quick oblivion (not provide file for QuattroPro but better a .csv)
- Obtained: a careful description of where these data have been picked up, from a print, from a disc provided by the author, etc.
A consideration must be done on the mistakes present in the dataset, almost everyone presents one or more numerical mistakes coming from a typing mistake, a wrong measure, two data reversed, etc..
We have preferred to take back the original data without any correction. Some time the study is really to find the statistical tool which highlights these mistakes.
Famous Data set
Starting from original paper collected in my library, list, descriptions, files in Lotus .WKS format & .CSV type 1 format:
- Tips:
- Contents: 244 Objects, 8 Columns, 7 Variables
- Source: P.G. Bryant, M.A. Smith, M. A. Practical Data Analysis: Case Studies in Business Statistics, Richard D. Irwin Publishing, Homewood, IL. (1995)
- Description: food servers' tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990
- Variables: (total bill, cost of the meal, including tax, in US dollars)=totbill, (tip, gratuity, in US dollars)=tip, (sex of person paying for the meal, 0=male, 1=female)=sex, (smoker in party? 0=No, 1=Yes)=smoker, (3=Thur, 4=Frid, 5=Sat, 6=Sun)=day, (0=Day, 1=Night)=time, (size of the party)=size
- Clas-var: (observation number)=obs
- Classes: none already present
- Main question: what are the factors that affect tipping behaviour?
- The files: tips.wks and tips.csv
- Obtained: from original paper of Bryant et al.
- Iris:
- Contents: 150 Objects, 5 Columns, 4 Variables
- Source: E. Anderson, The irises of the Gaspé peninsula, Bulletin of the American Iris Society, 59 (1935) 2-5
- Description: there are a family of Iris flowers, this data set measure only three of possible variance. Perhaps this is the more famous date set. It is famous for its intrinsic simplicity but also for the difficulty to find a statistic method that carefully separates the three classes. Unfortunately there are in the wild MANY badly iris data set! Notation can be done of R.A. Fisher works on iris data set.
- Variables: (sepal length, mm)=sepal-length, (sepal width, mm)=sepal-width, (petal length, mm)=petal-length, (petal width, mm)=petal-width
- Clas-var: (class name)=category
- Classes: 50 Iris setosa canadensis, 50 Iris versicolor, 50 Iris virginica
- Main question: find class separation method (also find possible measure errors)
- The files: iris.wks and iris.csv also: the iris flower, the setosa, the versicolor, the virginica.
- Obtained: first from original paper of E. Anderson, after looking for an errata corrige and with comparision to R.A. Fisher paper
- Cars:
- Contents: 406 Objects, 9 Columns, 8 Variables
- Source: D. Donoho, E. Ramos, "Primdata: Data Sets for Use With PRIM-H", (1982). Version for second (15-18, Aug, 1983) Exposition of Statistical Graphics Technology, by American Statistical Association
- Description: reading from presentation of II Exposition of Statistical Graphics Technology, Toronto, 15-18 Aug. 1983 ... "The purposes of the Exposition are to provide a forum in which users and providers of statistical graphics technology can exchange information and ideas .... A fixed data set is to be analyzed. You are asked to analyze these data using your statistical graphics software .... This data set is a version of the CRCARS data set of ...."
- Variables: (car name, model)=name, (miles per gallon)=mpg, (engine cylinders)=cylinders, (engine displacement, inches)=displacement, (engine power)=horsepower, (vehicle weight, lbs.)=weight, (time to accelerate from 0 to 60 mph, sec.)=acceleration, (model year, 19xx)=model.year
- Clas-var: (origin of car, 1=American, 2=European, 3=Japanese)=origin
- Classes: none already present
- Main question: your objective should be to achieve graphical displays which will be meaningful to the viewers and highlight relevant aspects of the data. The role of each presenter is to do his/her best job of presenting their statistical graphics technology to the viewers. As example you can find difference between diesel and gasoline cars, etc..
- The files: cars406.wks and cars406.csv also cars.desc.txt
- Obtained: from old meeting web site cars.data.txt
- Crabs:
- Contents: 200 Objects, 8 Columns, 5 Variables
- Source: N.A. Campbell, R.J. Mahon, A Multivariate Study of Variation in Two Species of Rock Crab of genus Leptograpsus, Australian, Journal of Zoology 22, (1974) 417-425
- Description: measurements on rock crabs of the genus Leptograpsus Variegatus had been split into two new species, previously grouped by colour, orange and blue. Sub classes can be male and female. Museum specimens lose their colour, so it was hoped that morphological differences would be classified
- Variables: (carapace frontal lip, mm)=FL, (carapace rear width, mm)=RW, (length of midline of the carapace, mm)=CL, (maximum width of carapace, mm)=CW, (body depth)=BD
- Clas-var: (n. index of sex/specie)=index, (2 crab species)=species, (crab sex)=sex
- Classes: 50 lept. varieg. Blue Male, 50 lept. varieg. Blue Female, 50 lept. varieg. Orange Male, 50 lept. varieg. Orange Female
- Main question: in a museum can we determine the species and sex of the crabs based on these five morphological measurements?
- The files: australian-crabs.wks and australian-crabs.csv also crab-carapace.png
- Obtained: I ask one of the my colleague to scan the original paper having in their university the original journal
- Wine:
- Contents: yyy Objects, xy Columns, yy Variables
- Source: M. Forina,
- Description:
- Variables: (yxyx)=yxy
- Clas-var: yxyxyx
- Classes: yxyx
- Main question: yxyxyx
- The files: yxyx.wk1 and yxyx.csv
- Obtained: from yxyx
- Olive Oils:
- Contents: 572 Objects, 12 Columns, 8 Variables
- Source: M. Forina, C. Armanino, S. Lanteri, E. Tiscornia, Classification of Olive Oils from their Fatty Acid Composition, in H. Martens, H.Jr Russwurm Eds, Food Research and Data Analysis, Applied Science Pub., London, (1983) 189-214
- Description: This data consists of the percentage composition of fatty acids found in the lipid fraction of Italian olive oils. The data arises from a study to determine the authenticity of an olive oil
- Variables: (palmitic acid %)=palmitic, (palmitoleic acid %)=palmitoleic, (stearic acid %)=stearic, (oleic acid %)=oleic, (linoleic acid %)=linoleic, (linolenic acid %)=linolenic, (arachidic acid %)=arachidic, (eicosenoic acid %)=eicosenoic
- Clas-var: (counter)=num, (super-classes)=region, (collection areas, 3 from the region North {Umbria, East and West Liguria}, 4 from South {North and South Apulia, Calabria, Sicily}, 2 from the Sardinia {inland and coastal Sardinia island}=area, (provenience)=classes
- Classes: 3 "super-classes" of Italy, North, South, Sardinia island, 9 collection areas
- Main question: Find statistical method able to distinguish the oils from different regions and areas in Italy based on their combinations of the fatty acids
- The files: olive-oils.wks and olive-oils.cvs also olive-oils.png
- Obtained: from a personal gift by one of the authors
- 10eurocents:
- Contents: yyy Objects, xy Columns, yy Variables
- Source:
- Description:
- Variables: (yxyx)=yxy
- Clas-var: (xyxyx)=xyxyx
- Classes: yxyx
- Main question: yxyxyx
- The files: yxyx.wk1 and yxyx.csv (lotus .WKS format & .CSV type 1 format)
- Obtained: from yxyx
- Cookie NIR (biscuit):
- Contents: this is a transpose matrix with: 707 rows (the Vars), 40 columns (the Objects)
- Source: B.G. Osborne, T. Fearn, A.R. Miller, S. Douglas, Application of Near Infrared Reflectance Spectroscopy to the Compositional Analysis of Biscuits and Biscuit Dough, J. Sci. Food Agric. 35 (1984) 99-105.
- Description: The experiment involved varying the composition of biscuit dough pieces. Two sets of dough pieces were measured, a calibration set and a prediction with NIR in the spectral range is 1100-2498nm in steps of 2nm. They were created and measured as two distinct sets, on separate occasions, and do not result from a random (or any other) split of a larger set.
- Variables: Vars are in rows, fat, sucrose, flour, water all in percents. Row 5 select calibration and validation set. Rows 6 and 7 must be explained. From row 8 up to 707 there are the nm.
- Clas-var: prediction of the 4 properties
- Classes: not classes already present
- Main question: Can we predict the fat% from the NIR measures? The same for the other properties. The authors suggest We recommend excluding observation 23 in the calibration set. It appears as an outlier in most analyses, we suspect because of an error in the compositional (lab) data. Sample number 21 in validation set shows up as an outlier in at least some analyses and might be considered for exclusion.
- The files: cookie-NIR-Fearn(transp).xls (Excel 2003 .XLS format), the cookie-readme.txt (2003 suggestions by P.J. Brown, T. Fearn, M. Vannucci) and the scatterplot
- Obtained: from a long research on Internet to obtain the correct values and the authors suggestions
- centrifug127:
- Contents: 127 Objects, 4 Columns, 3 Variables
- Source: prof. G. Visco and 2007/08 students. Exercise on morphometrics and sampling designs, course Chemistry for Restoration and Chemometrics at "Magistralis degree course in Sciences Applied to the Cultural Heritage for Diagnostics and for their Conservation". Rome University, La Sapienza, Chemistry Department, Rome, Italy, Europe
- Description: Exhaustive and simple random sampling with replacements on centrifuge test tubes with surprise! Being also of industrial production, sterile, apparently all the same but if subjected to a set of not destructive measures present a strange distribution. The measures are developed in laboratory with a Toolmex Polmach, RS n. 182-9408 micrometer (0.01 mm resolution, certified) and a Gibertini analytical balance (resolution 0.0001 g, certified). Measured at Rome University, chemistry laboratory for Cultural Heritage of Prof. M.P. Sammartino, in 28 november 2007
- Variables: (tube random assigned number)=num, (see the photo, diameter, at about 12 mm from the found, mm)=diam-low, (see the photo, diameter, under the stopper, mm)=diam-up, (weight, without the stopper, g)=weight
- Clas-var: none
- Classes: not already present
- Main question: more questions: can we describe the population with the 13 tubes only? There are differences among tubes? With simple charts/graphs we can show the difference among the tubes? What is the best chemometrics method to find and show clusters?
- The files: centrifug127.wks and centrifug127.csv, also the sampling: centrifug13.wks and centrifug13.csv
- Obtained: not obtained but directly measured by the professor of Chemometrics course (Dr. G. Visco) and the following students, M. Albini, C. Cacchione, M. De Paoli, L. Donato, D. Quarta. All using surgery gloves
- mixtures:
- Contents: 49 Observations, 4 Columns, 2 Variables
- Source: Smyth Padhraic, ICML '01 Keynote Talk', Williams College, MA, June 29th 2001
- Description: Mixture for "Naïve Bayes", two components of a mixture with 49 different values and two class "working as solvent", "not working".
- Variables: (observation n.)=obs, (% values of solvent n.1)=Xvalues, (% values of solvent n.2)=Yvalues
- Clas-var: (a or b class)=Class
- Classes: 2, "working as solvent" and "not working"
- Main question: Find a boundary between the two classes. Not soluble with LDA. Probable solution with SIMCA, with QDA, with LVQ, with ???. Simple, useful as exercise to teach chemometrics (suggest: start with a XY scatterplot with/without color), but with some difficults inside the objects position.
- The files: mixtures.wks and mixtures.csv
- Obtained: A .ppt of Smyth Padhraic presentation, but corrected with the addition of two missing values