28.07.2016.

Review of Free High Quality Tools for "Big Data" Processing Predictive analytics made easy

Ivana Lukec, Ph.D.

The volume of information available has grown significantly over the last decade and has open a totally new area of using the data to help both science and operations. These days everybody is talking about “big data” concepts.

We have looked into some of the data mining tools available for free use and have come to a conclusion – there is really a lot available!!! Great selection and great quality!

In this article, we have decided to focus on three of them that took the most of our attention. For sure, we will be covering more! Subscribe here to get the updates!

RapidMiner

Among the list of available tools for data mining purposes, RapidMiner Studio was the one that first dropped in our hands. The attribute that attracted us was: it seemed to be an open source and a very popular. And it is true. It has an open source option for scientists, students and researchers. However, after installing and testing a trial software and noticing all the capabilities, it was hard to expect all this would be free of charge. So, to use it fully, one must pay the license after the trial period of two weeks.
What impressed us at first site was easiness of all the steps from installation to the first analysis done on our set of data and developing of our first model. All steps were very simple and we were able to quickly have it up and running.

The software has a large number of drag-and-drop operators that will perform mathematical and graphical operations for you. The list of options for data analysis, modeling and prediction is really long and all the calculating analyzing tools are easily picked up on the menu and just dragged and dropped into the calculation area, then connected together and bum - you're ready to go!

It is an approach of simply building the blocks to develop the model.

As someone who is coming from industrial process monitoring and modeling field, my first look went straight away to options of modeling and predictions. An impressive list of 125 modeling tools such as different types of functions, regression, neural networks, correlation, parameter optimization etc. are included. Practically everything that someone dealing with data on a higher level could need. Equally useful for all the purposes: engineering and operations to research and scientific studies. Also, available are different tools for data preprocessing and cleansing and for their analysis and visualization. Developed models can be easily used for predictive applications and as such used in a variety of industries.

The developers have also built a good quality of instructions, available both online and while using the software. This is a software we would like to take a more detailed look and perform a more detailed analysis for the purposes of improved process monitoring of a chemical plant.

Subscribe to our mailing list to receive the update.

Orange

In idea, Orange is similar to RapidMiner, but is an absolutely open-source data visualization and analysis tool. It was developed by the Faculty of Computer and Information Science at the University of Ljubljana.
Data mining is done through visual programming or Python scripting. The tool has components for machine learning and includes add-ons for different data analytics features.
Similar to RapidMiner, data analysis is done through visual programming and by combining the various widgets to develop the model.

Add-ons cover most of standard data analysis tasks as data preprocessing, classification, different regression methods etc. As well, own widgets can be developed and integrated to perform one's own operations.

After processing the data, models can be used to predict the outcome for any data instance.

Orange, as well, allows you to focus on exploratory data analysis instead of coding, while clever defaults make fast prototyping of a data analysis workflow extremely easy. Place widgets on the canvas, connect them, load your datasets and harvest the insight!

Compared to RapidMiner, the list of calculation options is a bit shorter. However, having in mind the list is including all important calculation algorithms and that one is able to define their own and how easy and user-friendly it is to work in the software – it is really a worthy tool and for a recommendation. Orange as well has an impressive list of users and recommendations.

Weka

Weka is a product of the Computer Science Department at the University of Waikato in New Zealand. It is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka as well contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization and is an open source software issued under the GNU General Public License and very popular among students.

It does a variety of regression, clustering and classification problems as well as many statistical analyses to assist in better discovering and understanding patterns in large data sets. It contains several familiar neural networks as well as others developed specifically for the program. To make it even more useful, there are extensive graphics capabilities so that one may visualize patterns and results to further suggest different lines of analysis.

As this program requires Java to run, the Web site will tell you what Java version is required. Depending on your computing platform you may have to download and install it separately.

Weka may be run on Windows, Linux or a variety of other platforms, but we used version weka-3-8-0jre.exe; 100.8 MB on Windows.

Explorer, where data may be clustered, classified, associated and visualized with a number of algorithms consists of menu-driven commands. In this area, the user can import and save data sets, filter supervised and unsupervised classifiers, apply a wide variety of algorithms, cluster the data, train and test subsets of the data, and invoke a variety of plot types to visualize data and results with a variety of graphics.

We noticed that process of data selection and formatting is not that easy as with first two predictive tools. This tool is very popular among students and has a large supporting community.

We will for sure deepen this analysis with the tests done on our own set of real data. Subscribe here to stay updated!

Ivana Lukec, Ph.D. in chemical engineering, specialized in the field of mathematical modeling and simulation in process industry.