Data mining : WEKA


Hands-on Practice

- Weka
   This hands-on practice is consist of Introduction, Installation, Preprocessing, Classification, Clustering, Visualization, Select attributions, and Association by using Weka tool.


What is Data mining?

   Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. [From Wikipedia]

- Alternative names
   Knowledge Discovery in Databases(KDD), Knowledge extraction, Data/pattern analysis, Data archeology, Data dredging, Information harvesting, etc.


Data mining Techniques

- Preprocessing
   Real world's data contains noise and missing data. Those are can be removed with preprocessing techniques. It is essential process to apply data mining algorithms on target data. Frequently used preprocessing techniques are normalization, standardization, data cleaning, and so on.

- Feature Selection
   Feature selection is also known as attribute selection and variable selection. It selects a subset of the most relevant features to construct models.

- Classification
   Classification is supervised pattern learning technique using labeled training patterns. It constructs rules for classifying new data into the known groups. Well-known classifiers are k-NN, NaiveBayes, decision tree, and so on.

- Clustering
   Clustering is unsupervised pattern learning technique. It used for that, finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. There are K-means clustering and Hierarchical clustering as clustering techniques.

- Regression
   Regression is a statistical process for estimating the relationships among variables. It used to find a function that best fits(least error) the data point. There are two types of regressions, linear regression and non-linear regression.

- Association(Rule mining)
   With given a set of transactions, it finds rules that will predic the occurrence of an item based on the occurrences of other items in the transaction. Apriori algorithm is one of famous algorithm for rule mining.


Data mining Tools

- Weka
   Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
Download link : http://www.cs.waikato.ac.nz/ml/weka/


- R
   R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
Download link : http://www.r-project.org/


- Etc.
   There are many other tools for data mining, such as RapidMiner, KNIME, Rattle, and so on.


References

  • Wikipedia, Data mining from Wikipedia
  • Weka, http://www.cs.waikato.ac.nz/ml/weka/
  • R, http://www.r-project.org/
  • RDataMining.com, http://www.rdatamining.com/resources/tools/
  • "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2011-10-28.
  • Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Retrieved 2012-08-07.
  • Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0.
  • Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, and Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012.
  • Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0-471-22852-4. OCLC 50055336
  • Oscar Marban, Gonzalo Mariscal and Javier Segovia (2009); A Data Mining & Knowledge Discovery Process Model. In Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: Julio Ponce and Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438?453, February 2009, I-Tech, Vienna, Austria.