Springboot ElasticSearch

Springboot + ElasticSearch 构建博客检索系统

资料：

项目工具
- SpringBoot
- ElasticSearch
- Kibana
- PostMan
- Vue

ElasticSearch

分布式
全文检索
实时快速
Restful

Mysql	ES
Database	Index
Table	Type
Row	Document
Column	Field
Scheme	Mapping

MySQL: 
select * from user.user_info where name = "张三"

ES: 
GET /user/user_info/_search?q=name:张三

下载安装

下载版本
- ElasticSearch 6.3.2
- Kibana 6.3.2

国内镜像
- ElasticSearch 6.3.2
- Kibana 6.3.2

启动

# 启动 elasticsearch
cd elasticsearch-6.3.2
bash ./bin/elasticsearch

# 启动 kibana
cd kibana-6.3.2-darwin-x86_64
bash ./bin/kibana

查看：

elasticsearch: http://127.0.0.1:9200/
kibana: http://localhost:5601/

交互操作

# 查看所有索引
GET /_all

# 创建索引
PUT /person

# 添加数据
PUT /person/_doc/1
{
  "name": "Tom",
  "pets": ["pig", "cat"]
}

# 添加数据
PUT /person/_doc/2
{
  "name": "Jack",
  "pets": ["dog", "cat"]
}

# 获取数据
GET /person/_doc/1

# 搜索数据
GET /person/_doc/_search?q=name:Tom

# 复杂查询，可以省略_doc
POST /person/_search
{
  "query": {
    "bool": {
      "should": {
        "match": {
          "name": "Tom"
        }
      }
    }
  }
}

# or查询
POST /person/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": "Tom"
          }
        },
        {
          "match": {
            "name": "Jack"
          }
        }
      ]
    }
  }
}


# and查询
POST /person/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "Tom"
          }
        },
        {
          "match": {
            "name": "Jack"
          }
        }
      ]
    }
  }
}

# 删除索引
DELETE  /person

基于MySQL实现

create table blog(
  id int(11) not null primary key auto_increment,
  title varchar(60) default null,
  content text,
  create_time datetime default null,
  update_time datetime default null
)

select * from blog where title like "%spring%" or content like "%pring%";

基于ES实现

MySQL->ES数据同步

全量同步
增量同步

开源中间件

binlog订阅:
- alibaba/canel
- siddontang/go-mysql-elasticsearch(开发阶段)
- logstash(id/time)

logstash全量、增量同步

国内镜像下载 logstash 6.3.2

下载MySQL驱动 mysql-connector-java.jar

同步示例

create table user(
  id int(11) not null primary key auto_increment,
  name varchar(60) default null,
  age int(11),
  create_time datetime default CURRENT_TIMESTAMP,
  update_time datetime default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)

同步配置 mysql.conf

input {
  jdbc {
    # jdbc驱动包位置
    jdbc_driver_library => "./mysql-connector-java-8.0.16.jar"
    # 驱动类
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    # 数据库连接信息, 8.0以上版本：一定要把serverTimezone=UTC天加上
    jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/data?characterEncoding=utf8&useSSL=false&serverTimezone=UTC&rewriteBatchedStatements=true"
    # 用户
    jdbc_user => "root"
    # 密码
    jdbc_password => "123456"
    # 定时任务，默认一分钟
    schedule => "* * * * * *"
    # 全量同步，清空上次sql_last_value记录
    # clean_run => true
    # 执行的语句
    statement => "SELECT * FROM user WHERE update_time >= :sql_last_value"
    # 分页
    jdbc_paging_enabled => "true"
    jdbc_page_size => "5000"
    # 使用递增列的值
    use_column_value => true
    # 递增字段的类型
    tracking_column_type => "timestamp"
    # 递增字段的名称
    tracking_column => "update_time"
    # 同步点文件
    last_run_metadata_path => "user_syncpoint.txt"
  }
}
output {
    elasticsearch {
        # ES的IP地址及端口
        hosts => ["http://127.0.0.1:9200"]
        # 索引名称 可自定义
        index => "user"
        # 需要关联的数据库中有有一个id字段，对应类型中的id
        document_id => "%{id}"
    }
    stdout {
        # JSON格式输出
        codec => json_lines
    }
}

启动同步

$ ./bin/logstash -f ./config/mysql.conf

配置 pipelines.yml

- pipeline.id: table-user
  path.config: "./config/mysql.conf"

启动同步

$ ./bin/logstash

向user表中插入测试数据

# -*- coding: utf-8 -*-

from puremysql import PureMysql
from faker import Faker
import random

con = PureMysql(db_url="mysql://root:123456@127.0.0.1:3306/data?charset=utf8")
user_table = con.table("user")

# 生成模拟数据 100 * 5000 = 50W条
faker = Faker(locale="zh_CN")
for i in range(0, 100):
    lst = []
    for j in range(0, 5000):
        lst.append({
            "name": faker.name(),
            "age": random.randint(1, 100)
        })

    count = user_table.insert(lst)
    print(count)

con.close()

Jdbc input plugin 配置选项

Setting	Input type	Required	Default
clean_run	boolean	No	false
columns_charset	hash	No	{}
connection_retry_attempts	number	No	1
connection_retry_attempts_wait_time	number	No
jdbc_connection_string	string	Yes	-
jdbc_default_timezone	string	No	-
jdbc_driver_class	string	Yes	-
jdbc_driver_library	string	No	-
jdbc_fetch_size	number	No	-
jdbc_page_size	number	No	100000
jdbc_paging_enabled	boolean	No	false
jdbc_password	password	No	-
jdbc_password_filepath	a valid filesystem path	No	-
jdbc_pool_timeout	number	No	5
jdbc_user	string	Yes	-
jdbc_validate_connection	boolean	No	false
jdbc_validation_timeout	number	No	3600
last_run_metadata_path	string	No	"$HOME/.logstash_jdbc_last_run"
lowercase_column_names	boolean	No	true
parameters	hash	No	{}
plugin_timezone	string, one of ["local", "utc"]	No	"utc"
prepared_statement_bind_values	array	No	[]
prepared_statement_name	string	No	""
record_last_run	boolean	No	true
schedule	string	No	-
sequel_opts	hash	No	{}
sql_log_level	string, one of ["fatal", "error", "warn", "info", "debug"]	No	"info"
statement	string	No	-
statement_filepath	a valid filesystem path	No	-
tracking_column	string	No	-
tracking_column_type	string, one of ["numeric", "timestamp"]	No	"numeric"
use_column_value	boolean	No	false
use_prepared_statements	boolean	No	false

配置参考：https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

分词器

standard 中文单字拆分
simple
whitespace 不支持中文
language 不支持中文

POST _analyze
{
  "analyzer": "standard",
  "text": "hello world"
}

# hello world

POST _analyze
{
  "analyzer": "standard",
  "text": "中国人"
}
# 中 国 人

ik分词器

elasticsearch-analysis-ik

下载解压后放ES的plugins文件夹下，重启ES生效

分词语句：我是中国人

ik_smart：我/是/中国人
ik_max_word 我/是/中国人/中国/国人

自定义分词

添加自定义词语到文件

elasticsearch-analysis-ik-6.3.2/config/main.dic

再次分词

ik_smart：我是/中国人
ik_max_word 我是/中国人/中国/国人

SpringBoot集成ES

POST blog/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": "杏花"
          }
        },
        {
          "match_phrase": {
            "content": "杏花"
          }
        }
      ]
    }
  }
}