elasticsearch核心知识篇1 索引curd mapping query-CFANZ编程社区

1,索引的curd

GET _search
{
  "query": {
    "match_all": {}
  }
} 
#创建index
PUT /product
#查询
GET /product/_search#新增数据1
PUT /product/_doc/1 
{
  "name":"xiaomi phone",
  "desc":"shouji hong de zhandouji",
  "price":1999,
  "tags":[
    "xiingjiabi",
    "fashao",
    "buka"
  ]
}#新增数据2
PUT /product/_doc/2 
{
  "name":"xiaomi nfc phone",
  "desc":"shouji hong de zhandouji",
  "price":2999,
  "tags":[
  "xiingjiabi",
  "fashao",
  "buka"
  ]
} 
#修改数据方式1
POST /product/_update/1/
{
  "doc":{
    "price":1366
  }
}#修改数据方式2
POST /product/_doc/1/_update
{
"doc":{
"price":13656
}
}

2,Mapping-映射

概念：映射是定义文档及其包含的字段的存储和索引方式的过程。

两种映射方式 dynamic mapping（动态映射或自动映射）

expllcit mapping（静态映射或手工映射或显示映射） Mapping数据类型 Mapping参数

1 概念：

ES中的mapping有点类似与RDB中“表结构”的概念，在MySQL中，表结构里包含了字段名称，字段的类型还有索引信息等。在Mapping里也包含了一些属性，比如字段名称、类型、字段使用的分词器、是否评分、是否创建索引等属性，

并且在ES中一个字段可以有对个类型。分词器、评分等概念在后面的课程讲解。

2 查看mapping

GET /index/_mappings

3 ES数据类型

① 常见类型

1) 数字类型：

long integer short byte double float half_float scaled_float unsigned_long

2) Keywords：

keyword：适用于索引结构化的字段，可以用于过滤、排序、聚合。keyword类型的字段只能通过精确值（exact value）搜索到。Id应该用keyword

constant_keyword：始终包含相同值的关键字字段

wildcard：可针对类似grep的通配符查询优化日志行和类似的关键字值

关键字字段通常用于排序，汇总和Term查询，例如term。

3) Dates（时间类型）：包括date和 date_nanos

4) alias：为现有字段定义别名。

5) binary（二进制）：binary

6) range（区间类型）：integer_range、float_range、long_range、double_range、date_range

7) text：当一个字段是要被全文搜索的，比如Email内容、产品描述，这些字段应该使用text类型。设置text类型以后，字段内容会被分析，在生成倒排索引以前，字符串会被分析器分成一个一个词项。text类型的字段不用于排序，很少用于聚合。

（解释一下为啥不会为text创建正排索引：大量堆空间，尤其是在加载高基数text字段时。字段数据一旦加载到堆中，就在该段的生命周期内保持在那里。同样，加载字段数据是一个昂贵的过程，可能导致用户遇到延迟问题。

这就是默认情况下禁用字段数据的原因）

② 对象关系类型：

1) object：用于单个JSON对象

2) nested：用于JSON对象数组

3) flattened：允许将整个JSON对象索引为单个字段。

③ 结构化类型：

1) geo-point：纬度/经度积分

2) geo-shape：用于多边形等复杂形状

3) point：笛卡尔坐标点

2) shape：笛卡尔任意几何图形

④ 特殊类型：

1) IP地址：ip 用于IPv4和IPv6地址

2) completion：提供自动完成建议

3) tocken_count：计算字符串中令牌的数量

4) murmur3：在索引时计算值的哈希并将其存储在索引中

5) annotated-text：索引包含特殊标记的文本（通常用于标识命名实体）

6) percolator：接受来自query-dsl的查询

7) join：为同一索引内的文档定义父/子关系

8) rank features：记录数字功能以提高查询时的点击率。

9) dense vector：记录浮点值的密集向量。

10) sparse vector：记录浮点值的稀疏向量。

11) search-as-you-type：针对查询优化的文本字段，以实现按需输入的完成

12) histogram：histogram 用于百分位数聚合的预聚合数值。

13) constant keyword：keyword当所有文档都具有相同值时的情况的专业化。

⑤ array（数组）：在Elasticsearch中，数组不需要专用的字段数据类型。默认情况下，任何字段都可以包含零个或多个值，但是，数组中的所有值都必须具有相同的数据类型。

⑥新增：

1) date_nanos：date plus 纳秒

2) features：

4 两种映射类型

Dynamic field mapping：

整数 => long

浮点数 => float

true || false => boolean

日期 => date

数组 => 取决于数组中的第一个有效值

对象 => object

字符串 => 如果不是数字和日期类型，那会被映射为text和keyword两个类型

除了上述字段类型之外，其他类型都必须显示映射，也就是必须手工指定，因为其他类型ES无法自动识别。

Expllcit field mapping：手动映射

PUT /product
{
    "mappings": {
    "properties": {
        "field": {
          "mapping_parameter": "parameter_value"
           }
       }
     }
}

5 映射参数

① index：是否对创建对当前字段创建倒排索引，默认true，如果不创建索引，该字段不会通过索引被搜索到,但是仍然会在source元数据中展示

② analyzer：指定分析器（character filter、tokenizer、Token filters）。

③ boost：对当前字段相关度的评分权重，默认1

④ coerce：是否允许强制类型转换 true “1”=> 1 false “1”=< 1

⑤ copy_to：该参数允许将多个字段的值复制到组字段中，然后可以将其作为单个字段进行查询

⑥ doc_values：为了提升排序和聚合效率，默认true，如果确定不需要对字段进行排序或聚合，也不需要通过脚本访问字段值，则可以禁用doc值以节省磁盘空间（不支持text和annotated_text）

⑦ dynamic：控制是否可以动态添加新字段

1) true 新检测到的字段将添加到映射中。（默认）

2) false 新检测到的字段将被忽略。这些字段将不会被索引，因此将无法搜索，但仍会出现在_source返回的匹配项中。这些字段不会添加到映射中，必须显式添加新字段。

3) strict 如果检测到新字段，则会引发异常并拒绝文档。必须将新字段显式添加到映射中

⑧ eager_global_ordinals：用于聚合的字段上，优化聚合性能。

1) Frozen indices（冻结索引）：有些索引使用率很高，会被保存在内存中，有些使用率特别低，宁愿在使用的时候重新创建，在使用完毕后丢弃数据， Frozen indices的数据命中频率小，不适用于高搜索负载，数据不会被保存在内存中，

堆空间占用比普通索引少得多，Frozen indices是只读的，请求可能是秒级或者分钟级。\eager_global_ordinals不适用于Frozen indices**

⑨ enable：是否创建倒排索引，可以对字段操作，也可以对索引操作，如果不创建索引，让然可以检索并在_source元数据中展示，谨慎使用，该状态无法修改。

PUT my_index
    {
      "mappings": {
        "enabled": false
      }
    }

⑩ fielddata：查询时内存数据结构，在首次用当前字段聚合、排序或者在脚本中使用时，需要字段为fielddata数据结构，并且创建倒排索引保存到堆中

⑪ fields：给field创建多字段，用于不同目的（全文检索或者聚合分析排序）

⑫ format：格式化

"date": {
     "type":  "date",
     "format": "yyyy-MM-dd"
   }

⑬ ignore_above：超过长度将被忽略

⑭ ignore_malformed：忽略类型错误

⑮ index_options：控制将哪些信息添加到反向索引中以进行搜索和突出显示。仅用于text字段

⑯ Index_phrases：提升exact_value查询速度，但是要消耗更多磁盘空间

⑰ Index_prefixes：前缀搜索

1) min_chars：前缀最小长度，>0，默认2（包含）

2) max_chars：前缀最大长度，<20，默认5（包含）

⑱ meta：附加元数据

⑲ normalizer：

⑳ norms：是否禁用评分（在filter和聚合字段上应该禁用）。

21 null_value：为null值设置默认值

22 position_increment_gap：

23 proterties：除了mapping还可用于object的属性设置

24 search_analyzer：设置单独的查询时分析器：

25 similarity：为字段设置相关度算法，支持BM25、claassic（TF-IDF）、boolean

26 store：设置字段是否仅查询

27 term_vector：运维参数

#code

#Dynamic mapping
DELETE product_mapping
GET product_mapping/_mapping
PUT /product_mapping/_doc/1
{
  "name": "xiaomi phone",
  "desc": "shouji zhong de zhandouji",
  "count": 123456,
  "price": 123.123,
  "date": "2020-05-20",
  "isdel": false,
  "tags": [
    "xingjiabi",
    "fashao",
    "buka"
  ]
}

{
  "product_mapping" : {
    "mappings" : {
      "properties" : {
        "count" : {
          "type" : "long"
        },
        "date" : {
          "type" : "date"
        },
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "isdel" : {
          "type" : "boolean"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "price" : {
          "type" : "float"
        },
        "tags" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

View Code

#Dynamic mapping
DELETE product_mapping
GET product_mapping/_mapping
PUT /product_mapping/_doc/1
{
  "name": "xiaomi phone",
  "desc": "shouji zhong de zhandouji",
  "count": 123456,
  "price": 123.123,
  "date": "2020-05-20",
  "isdel": false,
  "tags": [
    "xingjiabi",
    "fashao",
    "buka"
  ]
}

#手工创建mapping（fields的mapping只能创建，无法修改）
#语法
GET product/_mapping
PUT /product
{
  "mappings": {
    "properties": {
      "date": {
        "type": "text"
      }
    }
  }
}

GET product/_mapping
#1 index

#案例
PUT /product
{
  "mappings": {
    "properties": {
      "date": {
        "type": "text"
      },
      "desc": {
        "type": "text",
        "analyzer": "english"
      },
      "name": {
        "type": "text",
        "index": "false"
      },
      "price": {
        "type": "long"
      },
      "tags": {
        "type": "text",
        "index": "true"
      },
      "parts": {
        "type": "object"
      },
      "partlist": {
        "type": "nested"
      }
    }
  }
}
#插入数据
GET product/_mapping
PUT /product/_doc/1
{
  "name": "xiaomi phone",
  "desc": "shouji zhong de zhandouji",
  "count": 123456,
  "price": 3999,
  "date": "2020-05-20",
  "isdel": false,
  "tags": [
    "xingjiabi",
    "fashao",
    "buka"
  ],
  "parts": {
    "name": "adapter",
    "desc": "5V 2A"
  },
  "partlist": [
    {
      "name": "adapter",
      "desc": "5V 2A"
    },
    {
      "name": "USB-C",
      "desc": "5V 2A 1.5m"
    },
    {
      "name": "erji",
      "desc": "boom"
    }
  ]
}
#查看
GET /product/_search
{
  "query": {
    "match_all": {}
  }
}
#验证
GET /product/_search
{
  "query": {
    "match": {
      "name": "xiaomi"
    }
  }
}

#copy_to
PUT copy_to
{
  "mappings": {
    "properties": {
      "field1": {
        "type": "text",
        "copy_to": "field_all" 
      },
      "field2": {
        "type": "text",
        "copy_to": "field_all" 
      },
      "field_all": {
        "type": "text"
      }
    }
  }
}

PUT copy_to/_doc/1
{
  "field1": "field1",
  "field2": "field2"
}
GET copy_to/_search
GET copy_to/_search
{
  "query": {
    "match": {
      "field_all": { 
        "query": "field1 field2"
      }
    }
  }
}

#coerce：是否允许强制类型转换
PUT coerce
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer"
      },
      "number_two": {
        "type": "integer",
        "coerce": false
      }
    }
  }
}
PUT coerce/_doc/1
{
  "number_one": "10" 
}
#//拒绝，因为设置了false
PUT coerce/_doc/2
{
  "number_two": "10" 
}  

DELETE coerce
PUT coerce
{
  "settings": {
    "index.mapping.coerce": false
  },
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "coerce": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}
PUT coerce/_doc/1
{ 
  "number_one": "10" 
} 
#拒绝，因为设置了false
PUT coerce/_doc/2
{
  "number_two": "10" 
  
} 

PUT /product/_mapping
{
  "properties": {
    "date": {
      "type": "text"
    }
  }
}

#7- 7 
PUT dynamic
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "user": {
        "properties": {
          "date": {
            "type": "text"
          },
          "desc": {
            "type": "text",
            "analyzer": "english"
          },
          "name": {
            "type": "text",
            "index": "false"
          },
          "price": {
            "type": "long"
          }
        }
      }
    }
  }
}
PUT /dynamic/_mapping
{
  "properties": {
    "date": {
      "type": "text"
    }
  }
}

#7-11 
GET /product/_mapping
#给city创建一个keyword
PUT fields_test
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "raw": { 
            "type":  "keyword"
          }
        }
      }
    }
  }
}

PUT fields_test/_doc/1
{
  "city": "New York"
}

PUT fields_test/_doc/2
{
  "city": "York"
}
GET fields_test/_mapping
GET fields_test/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

#忽略类型错误-常用于数据同步
PUT ignore_malformed
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }}
PUT ignore_malformed/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo" 
  
}   
#//虽然有异常 但是不抛出
PUT ignore_malformed/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo" 
  
}  
GET my_index/_search
#//数据格式不对    


#fielddata
#每个tag产品的数量   "size":0, 不显示原始结果
GET /product/_search
{
  "aggs": {
    "tag_agg_group": {
      "terms": {
        "field": "tags"
      }
    }
  },
  "size":0
}
GET /product/_mapping
#将文本field的fielddata属性设置为true
PUT /product/_mapping
{
  "properties": {
    "tags": {
      "type": "text",
      "fielddata": true
    }
  }
}

3,Query DSL(Domain Specific Language)

1 查询上下文

使用query关键字进行检索，倾向于相关度搜索，故需要计算评分。搜索是Elasticsearch最关键和重要的部分。

2 相关度评分：_score

概念：相关度评分用于对搜索结果排序，评分越高则认为其结果和搜索的预期值相关度越高，即越符合搜索预期值。在7.x之前相关度评分默认使用TF/IDF算法计算而来，7.x之后默认为BM25。在核心知识篇不必关心相关评分的具体原理，只需知晓其概念即可。

排序：相关度评分为搜索结果的排序依据，默认情况下评分越高，则结果越靠前。

3 元数据：_source

禁用_source：

好处：节省存储开销
坏处：

不支持update、update_by_query和reindex API。
不支持高亮。
不支持reindex、更改mapping分析器和版本升级。
通过查看索引时使用的原始文档来调试查询或聚合的功能。
将来有可能自动修复索引损坏。

总结：如果只是为了节省磁盘，可以压缩索引比禁用_source更好。

数据源过滤器：
Including：结果中返回哪些field
Excluding：结果中不要返回哪些field，不返回的field不代表不能通过该字段进行检索，因为元数据不存在不代表索引不存在

在mapping中定义过滤：支持通配符，但是这种方式不推荐，因为mapping不可变

PUT product
{
  "mappings": {
    "_source": {
      "includes": [
        "name",
        "price"
      ],
      "excludes": [
        "desc",
        "tags"
      ]
    }
  }
}

常用过滤规则

"_source": "false",
"_source": "obj.*",
"_source": [ "obj1.*", "obj2.*" ],
"_source": {

"includes": [ "obj1.\*", "obj2.\*" ],
"excludes": [ "*.description" ]

4 Query String

查询所有：

GET /product/_search

带参数：

GET /product/_search?q=name:xiaomi

分页：

GET /product/_search?from=0&size=2&sort=price:asc

精准匹配 exact value

GET /product/_search?q=date:2021-06-01

_all搜索相当于在所有有索引的字段中检索

GET /product/_search?q=2021-06-01

DELETE product
# 验证_all搜索
PUT product
{
  "mappings": {
    "properties": {
      "desc": {
        "type": "text", 
        "index": false
      }
    }
  }
}
# 先初始化数据
POST /product/_update/5
{
  "doc": {
    "desc": "erji zhong de kendeji 2021-06-01"
  }
}

5 全文检索-Fulltext query

GET index/_search
{
  "query": {
    ***
  }
}

match：匹配包含某个term的子句

match_all：匹配所有结果的子句

multi_match：多字段条件

match_phrase：短语查询，

6 精准查询-Term query

term：匹配和搜索词项完全相等的结果

term和match_phrase区别:
match_phrase 会将检索关键词分词, match_phrase的分词结果必须在被检索字段的分词中都包含，而且顺序必须相同，而且默认必须都是连续的
term搜索不会将搜索词分词
term和keyword区别
term是对于搜索词不分词,
keyword是字段类型,是对于source data中的字段值不分词

terms：匹配和搜索词项列表中任意项匹配的结果

range：范围查找

7 过滤器-Filter

GET _search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "status": "active"
        }
      }
    }
  }
}

filter：query和filter的主要区别在： filter是结果导向的而query是过程导向。query倾向于“当前文档和查询的语句的相关度”而filter倾向于“当前文档和查询的条件是不是相符”。即在查询过程中，query是要对查询的每个结果计算相关性得分的，而filter不会。另外filter有相应的缓存机制，可以提高查询效率。

8 组合查询-Bool query

bool：可以组合多个查询条件，bool查询也是采用more_matches_is_better的机制，因此满足must和should子句的文档将会合并起来计算分值

must：必须满足子句（查询）必须出现在匹配的文档中，并将有助于得分。
filter：过滤器不计算相关度分数，cache☆子句（查询）必须出现在匹配的文档中。但是不像 must查询的分数将被忽略。Filter子句在filter上下文中执行，这意味着计分被忽略，并且子句被考虑用于缓存。
should：可能满足 or子句（查询）应出现在匹配的文档中。
must_not：必须不满足不计算相关度分数 not子句（查询）不得出现在匹配的文档中。子句在过滤器上下文中执行，这意味着计分被忽略，并且子句被视为用于缓存。由于忽略计分，0因此将返回所有文档的分数。
minimum_should_match：参数指定should返回的文档必须匹配的子句的数量或百分比。如果bool查询包含至少一个should子句，而没有must或 filter子句，则默认值为1。否则，默认值为0