Scala Map


Scala Map is a collection of Key-value pair. A map cannot have duplicate keys but different keys can have same values i.e keys are unique whereas values can be duplicate.

Maps in Scala are not language syntax. They are library abstractions that you can extend and adapt.

Scala provides mutableand immutable alternatives for maps. Class hierarchy for scala maps is shown below:

Screenshot from 2017-04-28 18-34-09 

Image is taken from Programming in Scala by Martin Odersky

There’s a base Map trait in package scala.collection, and two subtraits – a mutable Mapin scala.collection.mutableand an immutable one in scala.collection.immutable . By default Scala Map is immutable but if you want to use mutable one then you need to import mutable Map by using statement : import scala.collection.mutable

View original post 1,282 more words

Starting HiveServer2 Programmatically


HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results (a more detailed intro here). The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

In this Blog we will learn how can we use HiveServer2 With java,Do not Confuse yourself there is no requirement of hive on your machine,we are creating a stand alone HiveServer2 which will run in embedded mode,so lets get started

for this purpose i am using sbt first create a sbt project with intellij now add following dependencies in build.sbt

libraryDependencies += "org.apache.hive" % "hive-exec" % "1.2.1" excludeAll ExclusionRule(organization = "org.pentaho") libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3" libraryDependencies += "org.apache.httpcomponents" % "httpclient" % "4.3.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" %…

View original post 798 more words

Smattering of HDFS


Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers.It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics applications.It is the primary storage system used by Hadoop applications.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.It provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.

What are NameNodes and DataNodes?

The NameNode

View original post 312 more words

Getting Started with Apache Cassandra


Why Apache Cassandra?

Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems.
Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent.
It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure.
It is highly available and offers a schema-free data model.


Cassandra is available for download from the Web here. Just click the link on the home page to download the latest release version and Unzip the downloaded cassandra  to a local directory.

Starting the Server:

View original post 221 more words

Getting Started with Apache Solr

Apache Solr is an open source search server. Apache Solr includes the full-text search engine called Apache Lucene. So basically Solr is an HTTP wrapper around an inverted Index provided by Lucene. The purpose of  Inverted Index is to allow the fast full-text search, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in retrieving the document systems, used on a large scale like in Search Engine.

Now you have a little bit of idea what Apache Solr does now lets download and start working on it. You can download the latest version from here.

It’s easy to install and start Apache Solr. just follow these steps and we are good to go.

  1. Download Apache Solr from here.
  2. Extract to the desired location
  3. Change directory to Apache Solr.
  4. type ./bin/solr start -e cloud -noprompt
  5. To stop Apache Solr type  ./bin/solr stop -all

Continue reading “Getting Started with Apache Solr”

Knoldus Bags the Prestigious Huawei Partner of the Year Award


Knoldus was humbled to receive the prestigious partner of the year award from Huawei at a recently held ceremony in Bangalore, India.


It means a lot for us and is a validation of the quality and focus that we put on the Scala and Spark Ecosystem. Huawei recognized Knoldus for the expertise in Scala and Spark along with the excellent software development process practices under the Knolway™ umbrella. Knolway™ is the Knoldus Way of developing software which we have curated and refined over the past 6 years of developing Reactive and Big Data products.

Our heartiest thanks to Mr. V.Gupta, Mr. Vadiraj and Mr. Raghunandan for this honor.


About Huawei

Huawei is a leading global information and communications technology (ICT) solutions provider. Driven by responsible operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in telecom and enterprise networks, devices, and cloud computing…

View original post 170 more words

Application compatibility for different Spark versions


Recently spark version 2.1 was released and there is a significant difference between the 2 versions.

Spark 1.6 has DataFrame and SparkContext while 2.1 has Dataset and SparkSession.

Now the question arises how to write code so that both the versions of spark are supported.

Fortunately maven provides the feature of building your application with different profiles.
In this blog i will tell you guys how to make your application compatible with different spark versions.

Lets start by creating a empty maven project.
You can use the maven-archetype-quickstart for setting up your project.
Archetypes provide a basic template for your project and maven has a rich collection of these templates for all your needs.

Once the project setup is done we need to make 3 modules. Lets name them core, spark and spark2 and setting the artifact id of each module to their respective names.

View original post 557 more words