Getting Started With Phantom



Phantom is Reactive type-safe Scala driver for Apache Cassandra/Datastax Enterprise. So, first lets explore what Apache Cassandra is with some basic introduction to it.

Apache Cassandra

Apache Cassandra is a free, open source data storage system that was created at Facebook in 2008. It is highly scalable database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database which is Schema-free. For more about Cassandra refer to this blogGetting Started With Cassandra.


We wanted to integrate Cassandra into Scala ecosystem that’s why we used Phantom-DSL as one of the module of outworkers. So, if you are planning on using Cassandra with Scala, phantom is the weapon of choice because of :-

  • Ease of use and quality coding.
  • Reducing code and boilerplate by at least 90%.
  • Automated schema generation

View original post 330 more words


Introduction to Perceptrons: Neural Networks

What is Perceptron?

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
Linear classifier defined that the training data should be classified into corresponding categories i.e. if we are applying classification for the 2 categories then all the training data must be lie in these two categories.
Binary classifier defines that there should be only 2 categories for classification.
Hence, The basic Perceptron algorithm is used for binary classification and all the training example should lie in these categories. The basic unit in the Neuron is called the Perceptron.

Origin of Perceptron:-

The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the United States Office of Naval Research. The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the “Mark 1 perceptron“. This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the “neurons“. Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors.


Component of Perceptron:- Following are the major components of a Perceptron

    • Input:- All the feature becomes the input for a perceptron. We denote the input of a perceptron by [x1, x2, x3, ..,xn], here x represent the feature value and n represent the total number of features. We also have special kind of input called the BIAS. In the image, we have described the value of bias as w0.
    • Weights:- Weights are the values that are computed over the time of training the model. Initial we start the value of weights with some initial value and these values get updated for each training error. We represent the weights for perceptron by [w1,w2,w3,.. wn].
    • BIAS:- A bias neuron allows a classifier to shift the decision boundary left or right. In an algebraic term, the bias neuron allows a classifier to translate its decision boundary. To translation is to “move every point a constant distance in a specified direction”.BIAS helps to training the model faster and with better quality.
    • Weighted Summation:- Weighted Summation is the sum of value that we get after the multiplication of each weight [wn] associated the each feature value[xn]. We represent the weighted Summation by ∑wixi for all i -> [1 to n]
    • Step/Activation Function:- the role of activation functions is make neural networks non-linear.For linerarly classification of example, it becomes necessary to make the perceptron as linear as possible.
    • Output:- The weighted Summation is passed to the step/activation function and whatever value we get after computation is our predicted output.

Inside The Perceptron:-



  • Fistly the features for an examples given as input to the Perceptron,
  • These input features get multiplied by corresponding weights [starts with initial value].
  • Summation is computed for value we get after multiplication of each feature with corresponding weight.
  • Value of summation is added to bias.then,
  • Step/Activation function is applied to the new value.

Refrences:- Perceptron The most basic form of Neual Network


Decision Tree using Apache Spark

What is Decision Tree?

A decision tree is a powerful method for classification and prediction and for facilitating decision making in sequential decision problems. A Decision tree is made up of two components:-
  • A Decision
  • Outcome

and, A Decision tree includes three type of Nodes:-

  • Root node: The top node of the tree comprising all the data.
  • Splitting node: A node that assigns data to a subgroup.
  • Terminal node:  Final decision (outcome).

To reach to an outcome or to get the result, the process starts with the root node, based on decision made on root node the next node i.e. splitter node is selected and based on decision made on split node another child split node is selected and this process goes on we reach to the terminal node and value of terminal node is our outcome.

Decision Tree in Apache Spark

It might sound strange or geeky, that there is no implementation of the Decision tree in Apache Spark, well technically yes because In Apache Spark, you can find implementation of Random Forest algorithm in which  number of trees can be specified by user So, under the hood Apache Spark call the Random forest with one tree.

In Apache Spark, The decision tree is a greedy algorithm that performs a recursive binary partitioning of the feature space. The tree predicts the same label for each bottom most (leaf) partition. Each partition is chosen greedily by selecting the best split from a set of possible splits, in order to maximize the information gain at a tree node.

The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification (Gini impurity and entropy)


Stopping rule

The recursive tree construction is stopped at a node when one of the following conditions is met:

  1. The node depth is equal to the trainingmaxDepth parameter.
  2. No split candidate leads to an information gain greater than.minInfoGain
  3. No split candidate produces child nodes which each have at least trainingminInstancesPerNode instances.

Useful Parameters

  • algo:- It can be either Classification or Regression.
  • numClasses :- No of classification classes.
  • maxDepth :- Maximum depth of a tree in term of nodes.
  • minInstancesPerNode :- For a node to be split further, each of its children must receive at least this number of training instances
  • minInfoGain :- For a node to be split further, the split must improve at least this much.
  • maxBins :- Number of bins used when discretizing continuous features

Preparing training data for Decision Tree

You can not directly feed any data to the Decision tree. it demands the special type of format to feed to the decision tree. You can use the HashingTF technique to convert the training data to labelled data so that decision tree can understand.This process is also known as Standardization of data.

Feeding and obtaining Result

Once data has been standardized then you can feed the same Decision Tree Algorithm for classification but before than you need to split the data for training and testing purpose i.e. to test the accuracy you need to hold some part of data for testing. You can feed the data like this.

val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)

Here, data is my standardized input data, to which i split in ratio 7:3 for training and testing purpose respectively, We are using `gini` impurity with maximum depth as `5`.

Once the model is generated you can try predict the classification for the other data, but before that we need to validate the accuracy of classification for recently generated Model. You can validate or compute the accuracy by computing `Test Error`.

// Evaluate model on test instances and compute test error
val labelAndPreds = { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println("Test Error = " + testErr)

Less the value of Test Error better the Model is Prepared.You can take a look at running example here.

Refrences:- Apache Spark Documentation

Introduction to Machine Learning

Before jumping directly into what is Machine Learning lets starts with the meaning of individual words i.e. What is Machine and What is Learning.

  • A machine is a tool containing one or more parts that transform energy. Machines are usually powered by chemical, thermal, or electrical means, and are often motorized.
  • Learning is the ability to improve behaviour based on Experience.

What is Machine Learning? :-

According to Tom Mitchell, Machine Learning is
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

  • Task T is what machine is seeking to improve. It can be Prediction, Classification, Clustering etc.
  • Experience E can be training data or input data through which Machine try to learn.
  • Performance P can be some factor like improvements accuracy or new skills that Machine was previously unaware about of etc.

Machine Learning1

Machine Learning itself contains 2 main components the Learner and the Reasoner.

  • Input/ Experience is given to the Learner that learn some new skills out of it.
  • Background Knowledge can also be given to Learner for better learning.
  • With the help of Input and Background, Knowledge Learner generates the Model.
  • The model contains the information about how he has learnt from the Input and Experience.
  • Now, the Problem/ task is given to Reasoner it can be Prediction, Classification etc.
  • With the help of trained  Model Reasonar tries to generate the Solution.
  • Solution / Answer can be improved with adding additional input/ Experience Background knowledge.

How Machine Learning different from Standard Program? :-

In machine learning, you feed the computer the following things

  • Input [Experience]
  • Output [output corresponding to inputs]

And get Model/ Program as Output. With the help of this program, you can perform some tasks.

Whereas In Standard Program you feed the computer following things

  • Input
  • Program [how to process the input]

And get the Output a simple example can be to verify a number of Prime or not.

Machine Learning

REFRENCES :- Machine Learning By Tom Michell


Cassandra Internals :- Writing Process

What is Apache Cassandra?

Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centres and cloud availability zones. Cassandra was originally developed at Facebook The main reason that Cassandra was developed is to solve  Inbox-search problem. To read more about Cassandra you can refer to this blog.

Why you should go for Cassandra over a Relational Database:-

Relational Database Cassandra

Handles moderate incoming data velocity

Handles high incoming data velocity

Data arriving from one/few locations

Data arriving from many locations

Manages primarily structured data

Manages all types of data

Supports complex/nested transactions

Supports simple transactions

Single points of failure with failover

No single points of failure; constant uptime

Supports moderate data volumes

Supports very high data volumes

Centralized deployments

Decentralized deployments

Data are written in mostly one location

Data written in many locations

Supports read scalability (with consistency sacrifices)

Supports read and write scalability

Deployed in vertical scale up fashion

Deployed in horizontal scale out fashion

How the write Request works in Cassandra:-

  • The client sends a write request to a single, random Cassandra node. The node who receives the request acts as a proxy and writes the data to the cluster.
  • The cluster of nodes is stored as a “ring” of nodes and writes are replicated to N nodes using a replication placement strategy.
  • With the RackAwareStrategy, Cassandra will determine the “distance” from the current node for reliability and availability purposes.
  • Now”distance” is broken into three buckets: the same rack as the current node, same data centre as the current node, or a different data centre.
  • You configure Cassandra to write data to N nodes for redundancy and it will write the first copy to the primary node for that data, the second copy to the next node in the ring in another data centre, and the rest of the copies to machines in the same data centre as the proxy.
  • This ensures that a single failure does not take down the entire cluster and the cluster will be available even if an entire data centre goes offline.

In Short, the write request goes from your client to a single random node, which sends the write to N different nodes according to the replication placement strategy. Now node waits for the N successes and then returns success to the client.

Each of those N nodes gets that write request in the form of a “RowMutation” message. The node performs two actions for this message:

  • Append the mutation to the commit log for transactional purposes
  • Update an in-memory Memtable structure with the change

Cassandra does not update data in-place on disk, nor update indices, so there are no intensive synchronous disk operations to block the write.

There are several asynchronous operations which occur regularly:

  • A “full” Memtable structure is written to a disk-based structure called an SSTable so we don’t get too much data in-memory only.
  • The set of temporary SSTables which exist for a given ColumnFamily is merged into one large SSTable. At this point, the temporary SSTables are old and can be garbage collected at some point in the future.

That’s how the Writing process works in Cassandra internally.

Refrences :-  A Brief Introduction to Apache Cassandra and Cassandra Internals

Unit Testing Of Kafka


Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another.

Generally, data is published to topic via Producer API and  Consumers API consume data from subscribed topics.

In this blog, we will see how to do unit testing of kafka.

Unit testing your Kafka code is incredibly important. It’s transporting your most important data. As of now we have to explicitly  run zookeeper and kafka server to test the Producer and Consumer.

Now there is also an alternate to test kafka without running zookeeper and kafka broker.

Thinking how ?   EmbeddedKafka is there for you.

Embedded Kafka is a library that provides an in-memory Kafka broker to run your ScalaTest specs against. It uses Kafka and ZooKeeper 3.4.8.

It will start zookeeper and kafka broker

View original post 367 more words

Ingress to Monix-Execution: Scheduler


In the previous blog you learnt about Monix Task. Now lets dive into Monix Scheduler. Akka’s Scheduler that includes pretty heavy dependency whereas Monix Scheduler that provides a common API which is resuable.

For Monix Scheduler to work first we need to include the dependency which is:-

libraryDependencies += "io.monix" % "monix-execution_2.11" % "2.0-RC10"

Make sure that your Scala version is same as Monix version in your dependency. The Monix Scheduler can be a replacement for Scala’s ExecutionContext as:-

View original post 72 more words