Elasticsearch Complete Tutorial

Elasticsearch, Logstash, Kibana simple guide for beginners

Elasticsearch complete tutorial

Elasticsearch is document oriented, meaning that it stores entire objects or documents.

ELK has three components

  1. 1.) Elasticsearch
  2. 2.) Logstash
  3. 3.) Kibanna

Within a search engine, mapping defines how a document is indexed and how its fields are stored and indexed.

We can compare mapping to a database schema in which how it describes the fields and properties that documents hold, the datatype of each field (string, integer, or date), and how those fields should be indexed and stored by Lucene.

Apache Lucene is the search engine, it is the core of Elasticsearch and it is built on top of it

What is inverted index?

In the below example, “Hi” is present in both document 1 and document 2 and similarly the rest of other words

Document 1: Hi! Welcome to tutorial Document 2: Hi! Welcome to traveldiaries4u

If I store the above document in inverted index it will be stored in Inverted Index as below

It will be mapped like this so that it can search through.

Hi: 1,2 Welcome: 1,2 To: 1,2 Tutorial: 1 traveldiaries4u: 2

What is Indices, Index and Documents

Indices is a collection of documents. You can add many documents to the indices.

Index searches all documents within a collection of “types”.

Both "indexes" and "indices" are plural forms of the word "index”.

Documents has ID assigned to it. It is identified by id. You can assign the unique id or if not specified then they will assign it by themselves.

Documents are json objects that corresponds to rows in relational databases.

Documents are the things that you are searching for.

Each document is stored in an index and has a type and an id.

The below query will list all the index and the mappings

GET _search { "query": { "match_all": {} } }

The below query will list all the indices as seen in the image.

what is Types

It is used to represent the type of document. This is also called as document types

Types consist of a name and a mapping (you will see below for explanation) and are used by adding the _type field.

This field is used for filtering when querying a specific type.

E.g. a twitter index could have a mapping of type "user" for storing all users, and a mapping of type "tweet" to store all tweets. Both of these types still belong to the same index, so you could search inside multiple types in the same index.

Since in elaticsearch documentation they mentioned that Types are deprecated and are in the process of being removed in future for several reasons, they forced new version of elasticsearch users to only use 1 mapping type per index i.e. you can have either user or tweet inside the twitter index but not both.

They further recommended to be consistent and use _doc as the name of the mapping type.

In the below query I have specified “_doc” as type.

1. PUT /users/_doc/1 2. { 3. "name": "ram" 4. }

Note: you need to be running at least Elasticsearch 6.2.0 to be able to specify “_doc” as the document “type”.

Mapping

Mapping is like a schema definition in a relational database.

you can compare mapping to a database schema in how it describes the fields and properties that documents hold, the datatype of each field (e.g., string, integer), and how those fields should be indexed and stored by Lucene.

If mapping is not defined, then it will be generated automatically when a document is indexed.

As for mapping, this defines the structure of documents, i.e. which fields they can contain, along with their field types. Field types are just data types, being strings, numbers, etc.

If you are confused with the “mapping” and “types”, Copy the link https://www.elastic.co/blog/found-elasticsearch-mapping-introduction to read in detail and come back

what is _source?

All the JSON document that you index will be stored in the _source field by default and will be returned by all get and search requests.

Sharding

Sharding divides index into small pieces of data called shards. These shards can be distributed to different nodes.

Suppose if you have index of 1gb but your hard drive limit is only 500mb, you can divide this index and distributed in two different nodes. This is called sharding

Sharding splits and scale volume of data.

Elasticsearch vs RDBMS

Elasticsearch RDBMS
Index Database
Shard Shard
Mapping Schema definition
Field Column
JSON Object Tuple
Document Row
Type Table

So far we have seen the basics now let us get handson.

Go to kibana, locate “Dev Tools”,

Below is the command structure to follow when doing an action.

<REST verb>/<Index>/<Type>/<API>

For example,

GET /myindex/mytype/_search

You can type the same in command line instead of dev tools by using curl in your command line. You can choose to use either by using Curl or Dev Tools

curl –XGET http://localhost:9200/myindex/mytype/_search

Create an Index

We will first create an index of name "product" and use pretty parameter to read it in a human way.

PUT /product?pretty

Create a type and document for index

Run this command to add a document object.

POST /product/_doc { "name": "ram", "address":{ "firstName": "ram", "lastName": "ji" } }

Note the Id field in the below image, it is unique id that got randomly generated for the document.You can specify your own too.

You can see the “_doc” which is nothing but the “type” This is also called as document type not to be confused with field type.

You will see the explanations for “types” as you go below.

you can change the document id by the below command.

Now you can do GET request to obtain the documents by ID as below

The _source field in the above image contains the original JSON document body that was passed at index time.

The _source field itself is not indexed and thus it is not searchable, but it will be stored so that it can be returned when executing fetch requests, like get or search.

Now you can use filter to the search results by using the query below.

The below query return the name matching the word “ram”.

GET /product/_search { "query": { "match" : { " name" : "ram" } } }

In the above query, “name” is the name of a field, you can substitute any name for the field instead.

How to search all field names?

The below query lists out all field names in an index.

GET index_name/_mappingtypes?pretty

Document update and versions

In the same query which we used above, I add a new object called “city” below to the fields that we have defined previously,

once I run this you can see the version gets changed to 2 indication how many times the document changed.

Document Update

You can do update by the below update command in order to not to type the entire code instead add only the updates

This can be done by specifying “ _update” api and “doc” property and key to this object as seen below.

And also we can add an array of strings in the “tags” object as seen below. "tags" is just a name you can rename it.

POST /product/_doc/1/_update { "doc": { "address": {"city": "theni", "tags": [ "Elasticsearch" ]}} }

Updating using scripts

We can use scripted updates to update the document object.

I have accessed the document object in a variable name “ctx”.

This variable contains value of the meta fields like “_id, _source, _version”.

ctx is a special variable that allows you to access the source of the object that you want to update.

We will use “_source “ field to change the values as seen below.

POST /product/_doc/1/_update { "script": "ctx._source.address.city = 'madurai' " }

Delete the document

Use the below command to delete the document

DELETE /product/_doc/1

Use the below command to delete the index

DELETE /product

Bulk method

You create multiple indices at once using _bulk

Below I have created two indices. Once I run , you can see the items listed.

Difference between Upsert and Update

Update refreshes existing records only. Upsert refreshes existing records and inserts new records if it does not find a match.

In the below query, After deleting, if you run post and get you will find the price is 100 as seen below.

Now if you run post again and GET request command again, the price will be added with 10.

Health of cluster

You can explore the health of the cluster by using the below commands,

GET /_cat/health?v

GET /_cat/nodes?v

GET /_cat/indices?v

How to add mapping to the indices?

By default , elasticsearch uses dynamic mapping to the fields if you don’t specify it explicitly.

Below command we will do mapping explicitly and we can see that data type is double for the “mark” field

Below command, lists all mappings

GET /product/_doc/_mapping

Note: if you need to change the existing mapping, you need to delete the index.

In the below image, as you can see that when I tried change the "mark" field type to "integer", it will result in error.

Elasticsearch Aggregation

It will provide the aggregated data on the search results.

Below is the Shakespeare data set, in which I have created index Shakespeare and its following fields are updated by using “bulk” api.

In the below query, we are running are sample aggregation to get the list of unique “play_count”.

In the bucket level aggregation as you can see that we have listed out the “play_name”.

The second one is the Cardinality Aggregation.

This aggregation gives the count of distinct values of a particular field.

As you can see that the distinct count of field “play_name” is 3 in the result.

Elasticsearch Analyzers

Elasticsearch has built-in analyzers, which can be used in any index.

Analyzing process is done by analyzers which involve the following process

  1. 1. Split the piece of text into individual terms or token
  2. 2. Standardize the individual terms so they become more searchable.

(The whole point is to search the text, characters in the index) Below uses an standard analyzer. This is a built-in analyzer.

Analyzer process is an combination of the following three functions

  1. 1. Tokenizers
  2. 2. Token filters
  3. 3. Character filters

Tokenizers

A tokenizer receives a stream of characters, breaks it up into individual tokens(usually individual words), and outputs a stream of tokens.

For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms like [Quick, brown, fox!].

You can see that characters are broken down using tokenizer standard filter. In our case we are using standard tokenizer.

Token filters

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase.

Below is an example of lowercase token filter

Character filters

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.

For instance it can strip HTML elements like <b> tag from the stream using (html_strip) filter

Inverted Index

The results of the analyzers should store somewhere right? Indeed it has to. That is where inverted index comes to rescue.

It consists of term across all documents in an index. So when performing a search query you are actually searching in inverted index

Custom analyzers

Below you can see that, I have configured analyzers and different type of filters.

Once I ran the POST, you can see the output that has been stripped by html_strip char filter, “drinking” has been stemmed to “drink” using stemmer filter.

Debugging the logs using search query

Explain API

You can use this explain api in the below query to find the reasons why the search doesn’t match.

GET /product/_doc/1/_explain { "query": { "term": { "name": "ramji" } } }

Term level queries vs full text queries

Term level needs exact values to be matched. Therefore it is good for searching numbers etc.

Full text queries are good for searching words.

Searching multiple terms

The below query will match the documents containing the words “ram” and “divya”.

In “tags.keyword”, tags is a field name. You can replace your field name when searching.

GET /product/_doc/_search { "query": { "terms": { "tags.keyword": [ "ram", "divya" ] } } }

Search using range

This query will search using the range specified.

GET /product/_doc/_search { "query": { "range": { "created": { "gte": "01-01-2019", "lte": "09-08-2019", "format": "dd-MM-yyyy" } } } }

Search using wildcards

GET /product/_doc/_search { "query": { "wildcard": { "tags.keyword": "ra*" } } }

From and Size parameters

Pagination of results can be done by using the from and size parameters.

The from parameter defines from the first result you want to fetch.

The size parameter allows you to configure the maximum amount of documents to be returned.

GET /_search { "from" : 0, "size" : 100, }

Boolean Search query

Below is the example of boolean search query will search for the field name “logger_name” that matches value “exception”

and it should match if the words like “FAILED” and “Exception” found in the field name “message”.

Operator “and” will get the results of both words “FAILED” and “Exception”.

GET myindex/_search { "query": { "bool": { "must": [ { "match": { "logger_name" : "exception" } } ], "should": [ { "multi_match" : { "query": "FAILED Exception", "fields": [ "message"], "operator": "and" } } ] } } }

If you want to try out elastic search without installing, then try out https://cloud.elastic.co. It is free for 14 days.