Quick & easy data exploration with RapidMiner

Here at Data Detectives we use RapidMiner for a majority of our data science applications.

In this blog you will see how we load data into RapidMiner from a local machine, and also some quick data exploration.

In this example we will be using the classic Titanic data set.

Loading data into RapidMiner

Above you can see how we use the retrieve operator in RapidMiner to select the Titanic data set from our local machine.

*Note we can also retrieve data from a database remotely given the appropriate drivers and administrative access.

Exploring the data from 'Results' view

After running the process in RapidMiner it will lead you to the 'results' view where you can inspect the data set.

In the image above you can see the number of rows and columns in the data set. In this case there are 1,309 rows and 12 columns. In the 'data' tab in 'results' view you can see the names of the columns and the different values in each row.

In the 'statistics' tab of the 'results' view in RapidMiner you can find information about your data set like the type of values in each column, if there are any missing values, maximum/minimum values, and also a visual distribution of each column.

*Note the statistics tab reveals there are 263 missing values in the 'age' column.

Checking distribution of various attributes from 'visualization' tab

By selecting 'open visualization' from the 'statistics' tab in the 'results' view of RapidMiner will take you to the 'visualizations' tab where you can see the distribution of each column and further explore the data through visualizations.

Some highlights from the visualizations above:

  • In the first visualization you will notice there were a lot more passengers in third class than any other class on the ship.

  • In the second visualization you will notice there were almost twice as many male passengers than female passengers.

  • In the third visualization you will notice the distribution of age among the passengers on the ship.

  • In the last visualization you will notice most passengers boarded the ship in the port of Southhampton.

*Note in the visualization above you can see that sadly more people didn't survive the inaugural voyage of the Titanic.

Combining attributes in 'visualizations' tab

From the 'visualizations' tab in RapidMiner you can easily combine attributes for further data exploration.

Some key takeaways from the visualizations above include:

  • In the first visualization you will notice there were a lot more female survivors than male survivors on the Titanic.

  • In the second visualization you can see there were very few survivors from third class cabins.

  • The last visualization shows that there were a lot more males in third class than there were female passengers.

So, this was a quick example of data exploration using RapidMiner. This whole process of data loading and exploration can be done quickly and easily using RapidMiner Studio.

154 views0 comments