Elasticsearch Ruby on Rails：Chewy Gem教程

本文概述

为什么要Chewy？
Elasticsearch基本指南
Rails集成
正在搜寻
测试Elasticsearch查询
本文总结
附录：Elasticsearch内部

Elasticsearch在Apache Lucene库的基础上提供了一个强大的RESTful HTTP接口, 用于索引和查询数据。它具有开箱即用的功能, 提供UTF-8支持, 可扩展, 高效且强大。它是用于索引和查询大量结构化数据的强大工具, 在srcmini, 它为我们的平台搜索提供了强大动力, 并且很快还将用于自动完成。我们是忠实的粉丝。

Chewy扩展了Elasticsearch-Ruby客户端, 使其功能更强大, 并提供了与Rails的更紧密集成。

由于我们的平台是使用Ruby on Rails构建的, 因此我们的Elasticsearch集成利用了Elasticsearch-ruby项目(用于Elasticsearch的Ruby集成框架, 该框架提供了用于连接到Elasticsearch集群的客户端, 用于Elasticsearch的REST API的Ruby API和各种扩展程序和实用程序)。在此基础上, 我们开发并发布了对Elasticsearch应用程序搜索体系结构的改进(和简化), 该体系结构打包为Ruby gem, 并命名为Chewy(此处提供示例应用程序)。

Chewy扩展了Elasticsearch-Ruby客户端, 使其功能更强大, 并提供了与Rails的更紧密集成。在此Elasticsearch指南中, 我(通过使用示例)讨论了如何完成此任务, 包括在实施过程中出现的技术障碍。

本直观指南介绍了Elasticsearch和Ruby on Rails之间的关系。

在继续阅读指南之前, 只需要简要说明一下：

GitHub上提供了Chewy和Chewy演示应用程序。
对于那些对Elasticsearch的更多”幕后”信息感兴趣的人, 我将其简要介绍作为本文的附录。

为什么要Chewy？

尽管Elasticsearch具有可扩展性和效率, 但将其与Rails集成并没有像预期的那么简单。在srcmini, 我们发现自己需要大大增强基本的Elasticsearch-Ruby客户端, 使其性能更高并支持其他操作。

尽管Elasticsearch具有可扩展性和效率, 但将其与Rails集成并没有像预期的那么简单。

因此, Chewy的gem诞生了。

Chewy的一些特别值得注意的功能包括：

每个索引都可以由所有相关模型观察到。

大多数索引模型彼此相关。有时, 有必要对这些相关数据进行非规范化, 然后将其绑定到同一对象(例如, 如果你想将标签数组及其相关文章一起编入索引)。 Chewy允许你为每个模型指定一个可更新的索引, 因此只要相关标签更新, 相应的文章就会重新索引。
索引类独立于ORM / ODM模型。

借助此增强功能, 例如, 实现跨模型自动补全变得更加容易。你可以只定义索引并以面向对象的方式使用它。与其他客户端不同, Chewy gem无需手动实现索引类, 数据导入回调和其他组件。
批量导入无处不在。

Chewy利用批量Elasticsearch API进行完整的重新索引和索引更新。它还利用了原子更新的概念, 在原子块中收集已更改的对象, 然后一次全部更新它们。
Chewy提供了一种AR风格的查询DSL。

通过可链接, 可合并和惰性, 此增强功能允许以更有效的方式生成查询。

好吧, 让我们看看这一切在gem中如何发挥作用……

Elasticsearch基本指南

Elasticsearch具有几个与文档相关的概念。第一个是索引(RDBMS中数据库的类似物)的索引, 它由一组文档组成, 可以是几种类型(其中一种是RDBMS表的类型)。

每个文档都有一组字段。每个字段都是独立分析的, 其分析选项针对其类型存储在映射中。 Chewy在其对象模型中”按原样”利用了这种结构：

class EntertainmentIndex < Chewy::Index
  settings analysis: {
    analyzer: {
      title: {
        tokenizer: 'standard', filter: ['lowercase', 'asciifolding']
      }
    }
  }

  define_type Book.includes(:author, :tags) do
    field :title, analyzer: 'title'
    field :year, type: 'integer'
    field :author, value: ->{ author.name }
    field :author_id, type: 'integer'
    field :description
    field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) }
  end

  {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope|
    define_type scope.includes(:director, :tags), name: type_name do
      field :title, analyzer: 'title'
      field :year, type: 'integer'
      field :author, value: ->{ director.name }
      field :author_id, type: 'integer', value: ->{ director_id }
      field :description
      field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) }
    end
  end
end

上面, 我们用三种类型定义了一种称为娱乐的Elasticsearch索引：书籍, 电影和卡通。对于每种类型, 我们为整个索引定义了一些字段映射和设置的哈希值。

因此, 我们定义了EntertainmentIndex, 并希望执行一些查询。第一步, 我们需要创建索引并导入数据：

EntertainmentIndex.create!
EntertainmentIndex.import
# EntertainmentIndex.reset! (which includes deletion, # creation, and import) could be used instead

.import方法知道导入的数据, 因为在定义类型时我们传入了范围。因此, 它将导入持久性存储中存储的所有书籍, 电影和动画片。

完成后, 我们可以执行一些查询：

EntertainmentIndex.query(match: {author: 'Tarantino'}).filter{ year > 1990 }
EntertainmentIndex.query(match: {title: 'Shawshank'}).types(:movie)
EntertainmentIndex.query(match: {author: 'Tarantino'}).only(:id).limit(10).load
# the last one loads ActiveRecord objects for documents found

现在, 我们的索引几乎可以在我们的搜索实现中使用了。

Rails集成

为了与Rails集成, 我们需要做的第一件事就是能够对RDBMS对象更改做出反应。 Chewy通过在update_index类方法中定义的回调来支持此行为。 update_index有两个参数：

以” index_name＃type_name”格式提供的类型标识符
要执行的方法名称或块, 表示对更新的对象或对象集合的反向引用

我们需要为每个依赖模型定义这些回调：

class Book < ActiveRecord::Base
  acts_as_taggable

  belongs_to :author, class_name: 'Dude'
  # We update the book itself on-change
  update_index 'entertainment#book', :self
end

class Video < ActiveRecord::Base
  acts_as_taggable

  belongs_to :director, class_name: 'Dude'
  # Update video types when changed, depending on the category
  update_index('entertainment#movie') { self if movie? }
  update_index('entertainment#cartoon') { self if cartoon? }
end

class Dude < ActiveRecord::Base
  acts_as_taggable

  has_many :books
  has_many :videos
  # If author or director was changed, all the corresponding
  # books, movies and cartoons are updated
  update_index 'entertainment#book', :books
  update_index('entertainment#movie') { videos.movies }
  update_index('entertainment#cartoon') { videos.cartoons }
end

由于还对标签进行了索引, 因此我们接下来需要对一些外部模型进行猴子补丁, 以便它们对更改做出反应：

ActsAsTaggableOn::Tag.class_eval do
  has_many :books, through: :taggings, source: :taggable, source_type: 'Book'
  has_many :videos, through: :taggings, source: :taggable, source_type: 'Video'

  # Updating all tag-related objects
  update_index 'entertainment#book', :books
  update_index('entertainment#movie') { videos.movies }
  update_index('entertainment#cartoon') { videos.cartoons }
end

ActsAsTaggableOn::Tagging.class_eval do
  # Same goes for the intermediate model
  update_index('entertainment#book') { taggable if taggable_type == 'Book' }
  update_index('entertainment#movie') { taggable if taggable_type == 'Video' &&
                                        taggable.movie? }
  update_index('entertainment#cartoon') { taggable if taggable_type == 'Video' &&
                                          taggable.cartoon? }
end

此时, 每个保存或销毁的对象都会更新相应的Elasticsearch索引类型。

原子性

我们仍然有一个挥之不去的问题。如果我们执行books.map(＆：save)之类的操作来保存多本书, 则每次保存一本书时, 我们都会请求更新娱乐索引。因此, 如果我们保存五本书, 则将对Chewy索引进行五次更新。此行为对于REPL是可以接受的, 但对于性能至关重要的控制器操作则肯定是不可接受的。

我们使用Chewy.atomic块解决此问题：

class ApplicationController < ActionController::Base
  around_action { |&block| Chewy.atomic(&block) }
end

简而言之, Chewy.atomic按以下方式批处理这些更新：

禁用after_save回调。
收集已保存书籍的ID。
Chewy.atomic块完成后, 使用收集的ID发出单个Elasticsearch索引更新请求。

正在搜寻

现在, 我们准备实现搜索界面。由于我们的用户界面是一种表单, 因此构建它的最佳方法当然是使用FormBuilder和ActiveModel。 (在srcmini, 我们使用ActiveData来实现ActiveModel接口, 但可以随意使用你喜欢的gem。)

class EntertainmentSearch
  include ActiveData::Model

  attribute :query, type: String
  attribute :author_id, type: Integer
  attribute :min_year, type: Integer
  attribute :max_year, type: Integer
  attribute :tags, mode: :arrayed, type: String, normalize: ->(value) { value.reject(&:blank?) }

  # This accessor is for the form. It will have a single text field
  # for comma-separated tag inputs.
  def tag_list= value
    self.tags = value.split(', ').map(&:strip)
  end

  def tag_list
    self.tags.join(', ')
  end
end

查询和过滤器教程

现在我们有了一个类似于ActiveModel的对象, 可以接受和类型转换属性, 让我们实现搜索：

class EntertainmentSearch
  ...

  def index
    EntertainmentIndex
  end

  def search
    # We can merge multiple scopes
    [query_string, author_id_filter, year_filter, tags_filter].compact.reduce(:merge)
  end

  # Using query_string advanced query for the main query input
  def query_string
    index.query(query_string: {fields: [:title, :author, :description], query: query, default_operator: 'and'}) if query?
  end

  # Simple term filter for author id. `:author_id` is already
  # typecasted to integer and ignored if empty.
  def author_id_filter
    index.filter(term: {author_id: author_id}) if author_id?
  end

  # For filtering on years, we will use range filter.
  # Returns nil if both min_year and max_year are not passed to the model.
  def year_filter
    body = {}.tap do |body|
      body.merge!(gte: min_year) if min_year?
      body.merge!(lte: max_year) if max_year?
    end
    index.filter(range: {year: body}) if body.present?
  end

  # Same goes for `author_id_filter`, but `terms` filter used.
  # Returns nil if no tags passed in.
  def tags_filter
    index.filter(terms: {tags: tags}) if tags?
  end
end

控制器和视图

此时, 我们的模型可以执行带有传递属性的搜索请求。用法如下所示：

EntertainmentSearch.new(query: 'Tarantino', min_year: 1990).search

请注意, 在控制器中, 我们要加载精确的ActiveRecord对象, 而不是Chewy文档包装器：

class EntertainmentController < ApplicationController
  def index
    @search = EntertainmentSearch.new(params[:search])
    # In case we want to load real objects, we don't need any other
    # fields except for `:id` retrieved from Elasticsearch index.
    # Chewy query DSL supports Kaminari gem and corresponding API.
    # Also, we pass scopes for every requested type to the `load` method.
    @entertainments = @search.search.only(:id).page(params[:page]).load(
      book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)}
    )
  end
end

现在, 是时候在Entertainment / index.html.haml上编写一些HAML了：

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f|
  = f.text_field :query
  = f.select :author_id, Dude.all.map { |d| [d.name, d.id] }, include_blank: true
  = f.text_field :min_year
  = f.text_field :max_year
  = f.text_field :tag_list
  = f.submit

- if @entertainments.any?
  %dl
    - @entertainments.each do |entertainment|
      %dt
        %h1= entertainment.title
        %strong= entertainment.class
      %dd
        %p= entertainment.year
        %p= entertainment.description
        %p= entertainment.tag_list
    = paginate @entertainments
- else
  Nothing to see here

排序

另外, 我们还将在搜索功能中添加排序功能。

假设我们需要对标题和年份字段以及相关性进行排序。不幸的是, 标题”一只杜鹃巢上的飞”将被拆分成单独的术语, 因此按这些完全不同的术语进行排序将太随意了。相反, 我们想按整个标题排序。

解决方案是使用特殊的标题字段并应用自己的分析器：

class EntertainmentIndex < Chewy::Index
  settings analysis: {
    analyzer: {
      ...
      sorted: {
        # `keyword` tokenizer will not split our titles and
        # will produce the whole phrase as the term, which
        # can be sorted easily
        tokenizer: 'keyword', filter: ['lowercase', 'asciifolding']
      }
    }
  }

  define_type Book.includes(:author, :tags) do
    # We use the `multi_field` type to add `title.sorted` field
    # to the type mapping. Also, will still use just the `title`
    # field for search.
    field :title, type: 'multi_field' do
      field :title, index: 'analyzed', analyzer: 'title'
      field :sorted, index: 'analyzed', analyzer: 'sorted'
    end
    ...
  end

  {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope|
    define_type scope.includes(:director, :tags), name: type_name do
      # For videos as well
      field :title, type: 'multi_field' do
        field :title, index: 'analyzed', analyzer: 'title'
        field :sorted, index: 'analyzed', analyzer: 'sorted'
      end
      ...
    end
  end
end

此外, 我们还将在搜索模型中添加以下新属性和排序处理步骤：

class EntertainmentSearch
  # we are going to use `title.sorted` field for sort
  SORT = {title: {'title.sorted' => :asc}, year: {year: :desc}, relevance: :_score}
  ...
  attribute :sort, type: String, enum: %w(title year relevance), default_blank: 'relevance'
  ...
  def search
    # we have added `sorting` scope to merge list
    [query_string, author_id_filter, year_filter, tags_filter, sorting].compact.reduce(:merge)
  end

  def sorting
    # We have one of the 3 possible values in `sort` attribute
    # and `SORT` mapping returns actual sorting expression
    index.order(SORT[sort.to_sym])
  end
end

最后, 我们将修改表单, 添加排序选项选择框：

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f|
  ...
  / `EntertainmentSearch.sort_values` will just return
  / enum option content from the sort attribute definition.
  = f.select :sort, EntertainmentSearch.sort_values
  ...

错误处理

如果你的用户执行不正确的查询, 例如(或AND), Elasticsearch客户端将引发错误。要处理此错误, 请对控制器进行一些更改：

class EntertainmentController < ApplicationController
  def index
    @search = EntertainmentSearch.new(params[:search])
    @entertainments = @search.search.only(:id).page(params[:page]).load(
      book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)}
    )
  rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
    @entertainments = []
    @error = e.message.match(/QueryParsingException\[([^;]+)\]/).try(:[], 1)
  end
end

此外, 我们需要在视图中呈现错误：

...
- if @entertainments.any?
  ...
- else
  - if @error
    = @error
  - else
    Nothing to see here

测试Elasticsearch查询

基本测试设置如下：

启动Elasticsearch服务器。
清理并创建我们的索引。
导入我们的数据。
执行我们的查询。
将结果与我们的期望交叉引用。

对于第1步, 可以使用elasticsearch-extensions gem中定义的测试集群。只需将以下行添加到项目的Rakefilegem后安装中：

require 'elasticsearch/extensions/test/cluster/tasks'

然后, 你将获得以下Rake任务：

$ rake -T elasticsearch
rake elasticsearch:start  # Start Elasticsearch cluster for tests
rake elasticsearch:stop   # Stop Elasticsearch cluster for tests

Elasticsearch和Rspec

首先, 我们需要确保索引已更新为与数据更改同步。幸运的是, Chewy的gem带有有用的update_index rspec匹配器：

describe EntertainmentIndex do
  # No need to cleanup Elasticsearch as requests are
  # stubbed in case of `update_index` matcher usage.
  describe 'Tag' do
    # We create several books with the same tag
    let(:books) { create_list :book, 2, tag_list: 'tag1' }

    specify do
      # We expect that after modifying the tag name...
      expect do
        ActsAsTaggableOn::Tag.where(name: 'tag1').update_attributes(name: 'tag2')
      # ... the corresponding type will be updated with previously-created books.
      end.to update_index('entertainment#book').and_reindex(books, with: {tags: ['tag2']})
    end
  end
end

接下来, 我们需要测试实际的搜索查询是否正确执行, 并返回预期结果：

describe EntertainmentSearch do
  # Just defining helpers for simplifying testing
  def search attributes = {}
    EntertainmentSearch.new(attributes).search
  end

  # Import helper as well
  def import *args
    # We are using `import!` here to be sure all the objects are imported
    # correctly before examples run.
    EntertainmentIndex.import! *args
  end

  # Deletes and recreates index before every example
  before { EntertainmentIndex.purge! }

  describe '#min_year, #max_year' do
    let(:book) { create(:book, year: 1925) }
    let(:movie) { create(:movie, year: 1970) }
    let(:cartoon) { create(:cartoon, year: 1995) }
    before { import book: book, movie: movie, cartoon: cartoon }

    # NOTE:  The sample code below provides a clear usage example but is not
    # optimized code.  Something along the following lines would perform better:
    # `specify { search(min_year: 1970).map(&:id).map(&:to_i)
    #                                  .should =~ [movie, cartoon].map(&:id) }`
    specify { search(min_year: 1970).load.should =~ [movie, cartoon] }
    specify { search(max_year: 1980).load.should =~ [book, movie] }
    specify { search(min_year: 1970, max_year: 1980).load.should == [movie] }
    specify { search(min_year: 1980, max_year: 1970).should == [] }
  end
end

测试群集故障排除

最后, 这是对测试群集进行故障排除的指南：

首先, 请使用内存中的单节点群集。规格将更快。在我们的情况下：TEST_CLUSTER_NODES = 1 rake elasticsearch：start
elasticsearch-extensions测试群集实施本身存在一些与单节点群集状态检查相关的问题(在某些情况下为黄色, 永远不会变为绿色, 因此绿色状态群集启动检查每次都会失败)。该问题已通过叉子修复, 但希望它将很快在主存储库中修复。
对于每个数据集, 请将你的请求按规范分组(即, 一次导入你的数据, 然后执行多个请求)。 Elasticsearch会长时间预热, 并且在导入数据时会占用大量堆内存, 因此请不要过度使用它, 尤其是当你有很多规格时。
确保你的计算机有足够的内存, 否则Elasticsearch将冻结(每个测试虚拟机大约需要5GB, Elasticsearch本身大约需要1GB)。

本文总结

Elasticsearch自称为”一个灵活而强大的开源, 分布式, 实时搜索和分析引擎。”这是搜索技术的黄金标准。

借助Chewy, 我们的Rails开发人员将这些优势打包为简单, 易于使用, 生产质量的开源Ruby gem, 它提供了与Rails的紧密集成。 Elasticsearch和Rails –太棒了！

Elasticsearch和Rails-太棒了！

鸣叫

附录：Elasticsearch内部

这是”内部”对Elasticsearch的非常简短的介绍…

Elasticsearch基于Lucene构建, Lucene本身使用倒排索引作为其主要数据结构。例如, 如果我们有字符串”狗跳得很高”, “越过篱笆”和”篱笆太高”, 则得到以下结构：

"the"       [0, 0], [1, 2], [2, 0]
"dogs"      [0, 1]
"jump"      [0, 2], [1, 0]
"high"      [0, 3], [2, 4]
"over"      [1, 1]
"fence"     [1, 3], [2, 1]
"was"       [2, 2]
"too"       [2, 3]

因此, 每个术语都包含对文本的引用和在文本中的位置。此外, 我们选择修改术语(例如, 删除” the”之类的停用词), 并对每个术语应用语音哈希(你能猜出算法吗？)：

"DAG"       [0, 1]
"JANP"      [0, 2], [1, 0]
"HAG"       [0, 3], [2, 4]
"OVAR"      [1, 1]
"FANC"      [1, 3], [2, 1]
"W"         [2, 2]
"T"         [2, 3]

如果我们随后查询”狗跳”, 则会以与源文本相同的方式进行分析, 在散列后变为” DAG JANP”(“狗”与”狗”具有相同的散列, “跳”和” “跳”)。

我们还在字符串中的各个单词之间添加了一些逻辑(基于配置设置), 在(” DAG”和” JANP”)或(” DAG”或” JANP”)之间进行选择。前者返回[0]＆[0, 1](即文档0)的交集, 而后者返回[0] | [0]。 [0, 1](即文档0和1)。文本位置可用于对结果评分和与位置相关的查询。

本文概述

为什么要Chewy？

Elasticsearch基本指南

Rails集成

原子性

正在搜寻

查询和过滤器教程

控制器和视图

排序

错误处理

测试Elasticsearch查询

Elasticsearch和Rspec

测试群集故障排除

本文总结

附录：Elasticsearch内部

相关推荐

评论抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

本文概述

为什么要Chewy？

Elasticsearch基本指南

Rails集成

原子性

正在搜寻

查询和过滤器教程

控制器和视图

排序

错误处理

测试Elasticsearch查询

Elasticsearch和Rspec

测试群集故障排除

本文总结

附录：Elasticsearch内部

相关推荐

评论 抢沙发

评论前必须登录！

猜你喜欢

热门标签

回顶部

评论抢沙发