Monday, January 19, 2015

Elastic Search



Analysis takes text as input and generates terms which are being indexed.



Cluster consists of one or more nodes which has the same cluster name. It automatically chooses a master node and if the master node fails it randomly chooses another master node.


Elastic Search - Keywords


  • term: A term is used to search for exact values. It matches exact value indexed in elasticsearch.
    • Example: States is not treated as states, StaTes, STATES.
  • text: Text is an unstructured text which are analyzed and resulting terms are indexed in elasticsearch.
  • analysis: It is the process of converting text into terms and indexing terms.
    • Example: Text is 'united states', 'United States' which will be indexed as 'united','states'.
  • cluster: Cluster consists of one or more nodes which has the same cluster name. It automatically chooses a master node and if the master node fails it randomly chooses another master node.
  • node: Node is a running instance of elasticsearch and belongs to a cluster. Any number of nodes can be started on a sever but usually one node per server is recommended. As soon as a node is started it searches for its cluster based on name and joins. Uses multicast or unicast for searching.
  • document: It is stored in elasticsearch index and is similar to a row in relational databases. It consists of id, type and document. It is a JSON object. Original document we indexed will be stored in the "_source" field.
  • index: It is like a database in the relational databases. It has a mapping which defines multiple types (table in relational database).
  • mapping: is like a schema definition in relational databases. It can use default settings or explicitly defined. It contains information of how each type in a document can be analyzed.
  • type: It is like a table in relational databases. It has list of fields for documents. 
  • id(index/type/id): Each document has an unique id and is auto generated if not supplied. 
  • field: It is like a column in a table. Document contains list of fields or key-value pairs. This can be scalar data or nested data. 
  • LucenceApache Lucene is a free open source information retrieval software library, originally written in Java. 
  • shard: It's an instance of Apache Lucene. It is automatically managed by elasticsearch and not managed by the user. An index is a logical namespace pointing to primary and replica shard. We can specify number of primary and replica shards for an index. 
  • primary shard: Each document is stored in a single primary shard. When we index it is indexed on primary shard first and then on replica shard. There are 5 default primary shards which can be increased or decreased before creating an index.
  • replica shard: each primary shard has zero or more replica shards. Whenever a primary shard fails replica is promoted to be a primary shard and reduces fail over. Replica shard increases performance by handling get and search requests. By default each primary shard has one replica shard and it can be scaled dynamically. Replica will not be started in the same node as primary replica.
  • routing: when indexing is done a document is stored on a single primary shard. This shard is chosen by hashing the routing value. Routing value is based on document id and if a document has parent it will be the parent document id. This ensures both the parent and child document are in the same shard. The routing value can be overridden by specifying at indexing time or in mapping.
  • source field: it is the field in document which holds the original JSON document we index.  

Reference: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html

Elastic Search - Directory Layout


TypeDescriptionDefault LocationSetting
homeHome of elasticsearch installation.path.home
binBinary scripts including elasticsearch to start a node.{path.home}/bin
confConfiguration files including elasticsearch.yml{path.home}/configpath.conf
dataThe location of the data files of each index / shard allocated on the node. Can hold multiple locations.{path.home}/datapath.data
logsLog files location.{path.home}/logspath.logs
pluginsPlugin files location. Each plugin will be contained in a subdirectory.{path.home}/pluginspath.plugin

Getting Started with cURL


  • Download cURL.
  • In windows add cURL.exe to environment variable.
  • Try 'curl www.google.com' in command prompt.

Getting Started with Elastic Search


  • Download link
  • Run
    • Run bin/elasticsearch on Unix
    • Run bin/elasticsearch.bat on Windows 
  •  Run 'curl -X GET http://localhost:9200/' in command prompt. If you have not setup cURL setup.
  • This is the default cluster with name 'elasticsearch'. 
  • We should do following to improve performance of elasticsearch
    • Increase JVM memory.
    • Increase number of open file descriptors.
    • Increase virtual memory.
    • Disable swapping.
Configurations:
  • By default the process will be in foreground which can be toggled to background using '-d' and toggled back to foreground using '-f'.
  • We can configure using -X and -D parameters while starting the cluster which will override default JAVA_OPTS or ES_JAVA_OPTS configuration.
    • Example: 'bin/elasticsearch -Xmx2g -Xms2g -Des.index.store.type=memory --node.name=my-node'
    • Xmx stands for maximum memory allocation pool for a Java Virtual Machine (JVM).
    • Xms stands for initial memory allocation pool for a Java Virtual Machine (JVM).
    • -Xmx1024k - 1024 kilobytes
    • -Xmx512m - 512 MB
    • -Xmx8g - 8 GB
  • ES_HEAP_SIZE helps in setting heap memory that is allocated to elasticsearch java process. It can be set using ES_MIN_MEM and ES_MAX_MEM parameters.

System Configurations:
  • file descriptors
    • Set maximum file descriptors.
      • '_setmaxstdio' for windows by default is 512 and maximum is 2048.
      • Recommended is 32k, 64k.
    • To view the number of file descriptors for the process use parameter '-Des.max-open-files=true' which will print the number of file descriptors for the process.
    • Alternatively user 'curl localhost:9200/_nodes/process?pretty'
  • virtual memory
    • By default mmap count is low for the operating system it can be improved.
      • 'sysctl -w vm.max_map_count=262144' in linux,
        • This can be set permanently using '/etc/sysctl.conf' file and updating 'vm.max_map_count=262144'.
  • memory settings
    • swap
      • By default linux swaps out processes which are not used which will result in poor node stability so swapping should be disabled.
      • Three options for swap
        • Disable swap completely.
          • sudo swapoff -a
          • Permanent Setting: comment out lines for 'swap' in '/etc/fstab'
        • Set vm.swappiness = 0, but still swap under emergency conditions.
        • mlockall, this locks address space into RAM which prevents swapping out.
          • set 'bootstrap.mlockall : true' in 'config/elasticsearch.yml'.
elasticsearch Settings:
  • Configuration files: are found under 'ES_HOME/config'.
    • 'elasticsearch.yml' for configuring elasticsearch different modules.
    • 'logging.yml' for configuring elasticsearch logging.
  • Paths for logs and data (path)
    • Usage: path.logs = 'path for logs', path.data = 'path for data'
    • Usage in commands: "-Des.path.logs = '/var/log/elasticsearch'"
    • path:  logs: /var/log/elasticsearch
        data: /var/data/elasticsearch
  • Cluster Name (cluster)
    • Usage: cluster.name = 'name of your cluster'
    • Usage in commands: "-Des.cluster.name = 'name of your cluster'"
    • cluster:  name: <NAME OF YOUR CLUSTER>
  • Node Name (node), this is the default node name. By default it will randomly assign a Marvel character name.
    • Usage: node.name = 'name of your node'
    • Usage in commands: "-Des.node.name = 'name of your node'"
    • node:  name: <NAME OF YOUR NODE>
  • By default uses YAML format, can be converted to JSON if necessary where Node Name will be:
    • {
          "node" : {
              "name" : "NAME OF YOUR NODE"
          }
      }
  • If an external file is used it can be configured using '-Des.config = /path/to/config/file'.
index settings
  • indices created can be memory based or file based. By default it is file based and can be memory based by passing YAML or JSON paramter.
    • Usage in commands: "-Des.index.store.type = memory"
logging
  • uses log4j and supports yaml/json/properties formats. If multiple files are present it merges all the files.
  • Prefix: logging.
  • Suffix: .yml, .yaml, .json, .properties
  • Folder contains required java packages. 
multiple data
  • path.data: /mnt/first,/mnt/second
  • path.data: ["/mnt/first", "/mnt/second"]

Tuesday, January 6, 2015

Implement an algorithm to determine if a string has all unique characters.

All these questions are available in internet and I have used Cracking the Coding Interview as reference for few questions. But the implementation might differ from the actual book as only questions are inspired from the books. Please suggest if there could be any better algorithms for the questions with better run time.