As I Learn

Wednesday, January 21, 2015

Monday, January 19, 2015

Elastic Search

Analysis takes text as input and generates terms which are being indexed.

Cluster consists of one or more nodes which has the same cluster name. It automatically chooses a master node and if the master node fails it randomly chooses another master node.

Elastic Search - Keywords

term: A term is used to search for exact values. It matches exact value indexed in elasticsearch.

Example: States is not treated as states, StaTes, STATES.

text: Text is an unstructured text which are analyzed and resulting terms are indexed in elasticsearch.
analysis: It is the process of converting text into terms and indexing terms.

Example: Text is 'united states', 'United States' which will be indexed as 'united','states'.

cluster: Cluster consists of one or more nodes which has the same cluster name. It automatically chooses a master node and if the master node fails it randomly chooses another master node.
node: Node is a running instance of elasticsearch and belongs to a cluster. Any number of nodes can be started on a sever but usually one node per server is recommended. As soon as a node is started it searches for its cluster based on name and joins. Uses multicast or unicast for searching.
document: It is stored in elasticsearch index and is similar to a row in relational databases. It consists of id, type and document. It is a JSON object. Original document we indexed will be stored in the "_source" field.
index: It is like a database in the relational databases. It has a mapping which defines multiple types (table in relational database).
mapping: is like a schema definition in relational databases. It can use default settings or explicitly defined. It contains information of how each type in a document can be analyzed.
type: It is like a table in relational databases. It has list of fields for documents.
id(index/type/id): Each document has an unique id and is auto generated if not supplied.
field: It is like a column in a table. Document contains list of fields or key-value pairs. This can be scalar data or nested data.
Lucence: Apache Lucene is a free open source information retrieval software library, originally written in Java.
shard: It's an instance of Apache Lucene. It is automatically managed by elasticsearch and not managed by the user. An index is a logical namespace pointing to primary and replica shard. We can specify number of primary and replica shards for an index.
primary shard: Each document is stored in a single primary shard. When we index it is indexed on primary shard first and then on replica shard. There are 5 default primary shards which can be increased or decreased before creating an index.
replica shard: each primary shard has zero or more replica shards. Whenever a primary shard fails replica is promoted to be a primary shard and reduces fail over. Replica shard increases performance by handling get and search requests. By default each primary shard has one replica shard and it can be scaled dynamically. Replica will not be started in the same node as primary replica.
routing: when indexing is done a document is stored on a single primary shard. This shard is chosen by hashing the routing value. Routing value is based on document id and if a document has parent it will be the parent document id. This ensures both the parent and child document are in the same shard. The routing value can be overridden by specifying at indexing time or in mapping.
source field: it is the field in document which holds the original JSON document we index.

Reference: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html

Elastic Search - Directory Layout

Type	Description	Default Location	Setting
home	Home of elasticsearch installation.		path.home
bin	Binary scripts including elasticsearch to start a node.	{path.home}/bin
conf	Configuration files including elasticsearch.yml	{path.home}/config	path.conf
data	The location of the data files of each index / shard allocated on the node. Can hold multiple locations.	{path.home}/data	path.data
logs	Log files location.	{path.home}/logs	path.logs
plugins	Plugin files location. Each plugin will be contained in a subdirectory.	{path.home}/plugins	path.plugin

Getting Started with cURL

Download cURL.
In windows add cURL.exe to environment variable.
Try 'curl www.google.com' in command prompt.

Getting Started with Elastic Search

Download link
Run

Run bin/elasticsearch on Unix
Run bin/elasticsearch.bat on Windows

Run 'curl -X GET http://localhost:9200/' in command prompt. If you have not setup cURL setup.

This is the default cluster with name 'elasticsearch'.
We should do following to improve performance of elasticsearch

Increase JVM memory.
Increase number of open file descriptors.
Increase virtual memory.
Disable swapping.

Configurations:

By default the process will be in foreground which can be toggled to background using '-d' and toggled back to foreground using '-f'.
We can configure using -X and -D parameters while starting the cluster which will override default JAVA_OPTS or ES_JAVA_OPTS configuration.

Example: 'bin/elasticsearch -Xmx2g -Xms2g -Des.index.store.type=memory --node.name=my-node'
Xmx stands for maximum memory allocation pool for a Java Virtual Machine (JVM).
Xms stands for initial memory allocation pool for a Java Virtual Machine (JVM).
-Xmx1024k - 1024 kilobytes
-Xmx512m - 512 MB
-Xmx8g - 8 GB

ES_HEAP_SIZE helps in setting heap memory that is allocated to elasticsearch java process. It can be set using ES_MIN_MEM and ES_MAX_MEM parameters.

System Configurations:

file descriptors

Set maximum file descriptors.

'_setmaxstdio' for windows by default is 512 and maximum is 2048.
Recommended is 32k, 64k.

To view the number of file descriptors for the process use parameter '-Des.max-open-files=true' which will print the number of file descriptors for the process.
Alternatively user 'curl localhost:9200/_nodes/process?pretty'

virtual memory

By default mmap count is low for the operating system it can be improved.

'sysctl -w vm.max_map_count=262144' in linux,

This can be set permanently using '/etc/sysctl.conf' file and updating 'vm.max_map_count=262144'.

memory settings

swap

By default linux swaps out processes which are not used which will result in poor node stability so swapping should be disabled.
Three options for swap

Disable swap completely.

sudo swapoff -a
Permanent Setting: comment out lines for 'swap' in '/etc/fstab'

Set vm.swappiness = 0, but still swap under emergency conditions.
mlockall, this locks address space into RAM which prevents swapping out.

set 'bootstrap.mlockall : true' in 'config/elasticsearch.yml'.

elasticsearch Settings:

Configuration files: are found under 'ES_HOME/config'.

'elasticsearch.yml' for configuring elasticsearch different modules.
'logging.yml' for configuring elasticsearch logging.

Paths for logs and data (path)

Usage: path.logs = 'path for logs', path.data = 'path for data'
Usage in commands: "-Des.path.logs = '/var/log/elasticsearch'"

path:  logs: /var/log/elasticsearch
  data: /var/data/elasticsearch

Cluster Name (cluster)

Usage: cluster.name = 'name of your cluster'
Usage in commands: "-Des.cluster.name = 'name of your cluster'"
cluster: name: <NAME OF YOUR CLUSTER>

Node Name (node), this is the default node name. By default it will randomly assign a Marvel character name.

Usage: node.name = 'name of your node'
Usage in commands: "-Des.node.name = 'name of your node'"

```
node:  name: <NAME OF YOUR NODE>
```

By default uses YAML format, can be converted to JSON if necessary where Node Name will be:

{
    "node" : {
        "name" : "NAME OF YOUR NODE"
    }
}

If an external file is used it can be configured using '-Des.config = /path/to/config/file'.

index settings

indices created can be memory based or file based. By default it is file based and can be memory based by passing YAML or JSON paramter.

Usage in commands: "-Des.index.store.type = memory"

logging

uses log4j and supports yaml/json/properties formats. If multiple files are present it merges all the files.
Prefix: logging.
Suffix: .yml, .yaml, .json, .properties
Folder contains required java packages.

multiple data

path.data: /mnt/first,/mnt/second
path.data: ["/mnt/first", "/mnt/second"]

Tuesday, January 6, 2015

Implement an algorithm to determine if a string has all unique characters.

All these questions are available in internet and I have used Cracking the Coding Interview as reference for few questions. But the implementation might differ from the actual book as only questions are inspired from the books. Please suggest if there could be any better algorithms for the questions with better run time.