跳转至

Django-haystack全局搜索

简述

Haystack(稻草堆)一般用于形容待检索的集合,needle(针)一般用于形容待检索的对象,以大海捞针来形容全局搜索可以说是十分的形象生动。简述如何使用django-haystack+Whoosh+jieba来实现多模型混合检索,支持模糊检索、分模型检索、检索结果高亮检索关键字,以及django-haystck+django-hvad实现多语言环境中的全局检索。

1、 安装对应插件

django-haystack==2.6.1
Whoosh==2.7.4
jieba==0.39

2、 更新settings.py文件

# 检索服务
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'common.whoosh_cn_backend.ChineseWhooshEngine',
        'PATH': os.path.join(os.path.dirname(__file__), 'whoosh_index'),
    },
}
# 自动更新搜索索引
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

3、 将jieba分词加入检索引擎中

新建文件common/whoosh_cn_backend.py(若修改了文件夹或文件名请相应修改settings.py中的HAYSTACK_CONNECTIONS->ENGINE路径),内容如下:

from haystack.constants import DJANGO_CT, DJANGO_ID, ID
from haystack.exceptions import SearchBackendError

from whoosh.fields import ID as WHOOSH_ID
from whoosh.fields import BOOLEAN, DATETIME, IDLIST, KEYWORD, NGRAM, NGRAMWORDS, NUMERIC, Schema, TEXT

from jieba.analyse import ChineseAnalyzer
from haystack.backends.whoosh_backend import WhooshSearchBackend, WhooshEngine


class ChineseWhooshSearchBackend(WhooshSearchBackend):

    def build_schema(self, fields):
        schema_fields = {
            ID: WHOOSH_ID(stored=True, unique=True),
            DJANGO_CT: WHOOSH_ID(stored=True),
            DJANGO_ID: WHOOSH_ID(stored=True),
        }
        # Grab the number of keys that are hard-coded into Haystack.
        # We'll use this to (possibly) fail slightly more gracefully later.
        initial_key_count = len(schema_fields)
        content_field_name = ''

        for field_name, field_class in fields.items():
            if field_class.is_multivalued:
                if field_class.indexed is False:
                    schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
                else:
                    schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True,
                                                                         field_boost=field_class.boost)
            elif field_class.field_type in ['date', 'datetime']:
                schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored, sortable=True)
            elif field_class.field_type == 'integer':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, numtype=int,
                                                                     field_boost=field_class.boost)
            elif field_class.field_type == 'float':
                schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, numtype=float,
                                                                     field_boost=field_class.boost)
            elif field_class.field_type == 'boolean':
                # Field boost isn't supported on BOOLEAN as of 1.8.2.
                schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
            elif field_class.field_type == 'ngram':
                schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored,
                                                                   field_boost=field_class.boost)
            elif field_class.field_type == 'edge_ngram':
                schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start',
                                                                        stored=field_class.stored,
                                                                        field_boost=field_class.boost)
            else:
                schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(),
                                                                  field_boost=field_class.boost, sortable=True)

            if field_class.document is True:
                content_field_name = field_class.index_fieldname
                schema_fields[field_class.index_fieldname].spelling = True

        # Fail more gracefully than relying on the backend to die if no fields
        # are found.
        if len(schema_fields) <= initial_key_count:
            raise SearchBackendError(
                "No fields were found in any search_indexes. Please correct this before attempting to search.")

        return (content_field_name, Schema(**schema_fields))


class ChineseWhooshEngine(WhooshEngine):
    backend = ChineseWhooshSearchBackend

主要操作是使用from jieba.analyse import ChineseAnalyzer中的ChineseAnalyzer替换原先的StemmingAnalyzer,以达到更好的中文分词效果。

4、 为模型检索索引

索引文件放于各模型所在app的根目录下,默认名称为search_indexes.py,以新闻和招聘为例:

news/search_indexes.py

from haystack import indexes
from .models import News


class NewsIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(null=True, model_attr='title')
    summary = indexes.CharField(null=True, model_attr='summary')
    content = indexes.CharField(null=True, model_attr='content')
    publish_time = indexes.DateTimeField(model_attr='publish_time')
    lang = indexes.CharField(model_attr='language_code')

    def get_model(self):
        return News

    def index_queryset(self, using=None):
        """Used when the entire index for model is updated."""
        return self.get_model().objects.language('all').all()

    def read_queryset(self, using=None):
        return self.get_model().objects.language()

recruit/search_indexes.py

from haystack import indexes
from .models import Recruit


class RecruitIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    job = indexes.CharField(null=True, model_attr='job')
    place = indexes.CharField(null=True, model_attr='place')
    content = indexes.CharField(null=True, model_attr='content')
    publish_time = indexes.DateTimeField(model_attr='publish_time')
    lang = indexes.CharField(model_attr='language_code')

    def get_model(self):
        return Recruit

    def index_queryset(self, using=None):
        """Used when the entire index for model is updated."""
        return self.get_model().objects.language('all').all()

    def read_queryset(self, using=None):
        return self.get_model().objects.language()

NewsIndex、RecruitIndex里的lang = indexes.CharField(model_attr='language_code')index_queryset\read_queryset中的.language()方法均为兼容django-hvad的写法,未使用django-hvad可删除lang = indexes.CharField(model_attr='language_code'),将.language()替换为.all()即可。

在templates目录下新增search目录,目录树如下:

├─templates
  └─search
      └─indexes
          ├─case
          ├─download
          ├─flatpage
          ├─goods
          ├─news
          ├─product
          ├─recruit
          ├─service
          ├─solution
          └─staff

indexes下的文件夹以app小写名称开头,在news文件夹下新增news_text.txt文件,内容如下:

{{ object.title }}
{{ object.summary }}
{{ object.content }}
{{ object.publish_time }}

recruit文件夹下新增recruit_text.txt文件,内容如下:

{{ object.job }}
{{ object.place }}
{{ object.content }}
{{ object.publish_time }}

然后执行 python manage.py rebuild_index 来建立索引,如已建立索引,可执行 python manage.py update_index 来更新索引。

5、 多语言检索

urls.py

url(r'^search/', LangSearchView.as_view(), name='web_search'),

views.py

from haystack.generic_views import SearchView

# 多语言搜索
class LangSearchView(BaseMixin, SearchView):
    form_class = LangSearchForm
    paginate_by = 10

forms.py

from django.utils.translation import get_language
from haystack.forms import HighlightedSearchForm


class LangSearchForm(HighlightedSearchForm):

    def search(self):
        sqs = super(LangSearchForm, self).search()

        sqs = sqs.filter(lang=get_language())
        return sqs

6、 搜索结果高亮

新增页面/templates/search/search.html:

{% extends 'web/base.html' %}
{% load i18n %}
{% load highlight %}

{% block seo %}
    {% trans "搜索结果" as default_seo_title %}
    {% include "web/seo.html" with default_seo_title=default_seo_title %}
{% endblock %}

{% block css %}
    <style>
        span.highlighted {
            color: #22a7c6;
        }
    </style>
{% endblock %}

{% block main %}
    <div id="news_search">
        <div class="container base news_search_container">
            <h2>{% trans '搜索' %}</h2>
            <form method="get" action="{% url 'web_search' %}" class="news_search_form">
                <input class="news_search_input" type="text" name="q"
                        {% if query %} value="{{ query }}" {% else %} value="" {% endif %}
                       placeholder="{% trans '搜索你想要的关键字' %}">
                <input type="submit" value="" class="back_submit">
                <input type="submit" value="" id="search"
                       style="background:url('/static/web/img/search_white.png') no-repeat center center;width: 60px;height: 40px;">
            </form>

            <div class="thumb-container">
                <span><img src="/static/web/img/location.png" alt=""></span>
                <a href="{% url 'web_search' %}">{% trans '关键词搜索' %}</a>
            </div>

            {% if query %}
                <div class="information-list wow fadeInUp ">
                    <h3>{% trans '搜索结果:' %}</h3>
                    {% for result in object_list %}
                        {% if result.model_name == 'news' %}
                            <a class="information-item" href="{{ result.object.get_absolute_url }}"
                               {% if result.object.url %}target="_blank"{% endif %}>
                                <div class="information-content">
                                    <div class="information_div">
                                        <p class="information-title">{% highlight result.object.title with query %}</p>
                                        <p class="information-date">{{ result.object.publish_time.date }}</p>
                                    </div>
                                    <div class="information-summary-div">
                                        <p class="information-summary">{% highlight result.object.summary with query %}</p>
                                    </divclass>
                                </div>
                            </a>
                        {% elif result.model_name == 'recruit' %}
                            <a class="information-item" href="{{ result.object.get_absolute_url }}"
                               {% if result.object.url %}target="_blank" {% endif %}>
                                <div class="information-content">
                                    <div class="information_div">
                                        <p class="information-title">{% highlight result.object.job with query %}</p>
                                        <p class="information-date">{{ result.object.update_time.date }}</p>
                                    </div>
                                    <div>
                                        <p class="information-summary">{% highlight result.object.place with query %}</p>
                                    </div>
                                </div>
                            </a>
                        {% endif %}
                    {% empty %}
                        <p>{% trans '没有找到您想要搜索的结果...' %}</p>
                    {% endfor %}
                </div>

                {% include 'web/pagination.html' %}
            {% else %}
                {# Show some example queries to run, maybe query syntax, something else? #}
            {% endif %}
        </div>
    </div>
{% endblock %} 

使用highlight标签+自定义的span.highlighted css可以让搜索结果高亮,但SearchForm必须要继承HighlightedSearchForm{{ result.model_name }}可以获取到当前结果对应的模型名,据此可对不同的搜索结果做出不同的展示效果。{{ result.object }}既为原始对象,可据此获取到原始对象的各个属性。若想可以按模型检索请使用ModelSearchFormHighlightedModelSearchForm,检索发起的form不宜再自定义,应使用{{ form.as_table }}