2606 lines
134 KiB
Plaintext
2606 lines
134 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Анализ текста"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Полезные ссылки:\n",
|
|||
|
"- https://spacy.io/usage/linguistic-features\n",
|
|||
|
"- https://habr.com/ru/articles/738176/"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Инициализация движка (модуля) для анализа текста"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import spacy\n",
|
|||
|
"sp = spacy.load(\"ru_core_news_lg\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Базовые операции над текстом"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Текст для примеров"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"text = \"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\"\n",
|
|||
|
"\n",
|
|||
|
"text_ner = \"В рамках торжественного открытия экспозиции за активное участие в подготовке материалов были вручены благодарственные письма профессору кафедры летной эксплуатации и безопасности полетов Ульяновского института гражданской авиации имени Главного маршала авиации Б.П.Бугаева Сергею Косачевскому, начальнику управления научно-исследовательской и инновационной деятельности Ульяновского государственного педагогического университета имени И.Н.Ульянова Светлане Богатовой, руководителю пресс-службы Ульяновского государственного аграрного университета имени П.А.Столыпина Винере Насыровой, помощнику проректора по научной работе Ульяновского государственного университета Татьяне Лисовой.\""
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Предобработка"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Изменение регистра"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\n",
|
|||
|
"накануне дня российской науки, 7 февраля, в ульяновском государственном техническом университете открыли новую экспозицию «они стояли у истоков ульяновской науки». она посвящена выдающимся ученым xx – начала xxi веков, чьи достижения и изобретения сформировали научный и технологический облик ульяновской области.\n",
|
|||
|
"НАКАНУНЕ ДНЯ РОССИЙСКОЙ НАУКИ, 7 ФЕВРАЛЯ, В УЛЬЯНОВСКОМ ГОСУДАРСТВЕННОМ ТЕХНИЧЕСКОМ УНИВЕРСИТЕТЕ ОТКРЫЛИ НОВУЮ ЭКСПОЗИЦИЮ «ОНИ СТОЯЛИ У ИСТОКОВ УЛЬЯНОВСКОЙ НАУКИ». ОНА ПОСВЯЩЕНА ВЫДАЮЩИМСЯ УЧЕНЫМ XX – НАЧАЛА XXI ВЕКОВ, ЧЬИ ДОСТИЖЕНИЯ И ИЗОБРЕТЕНИЯ СФОРМИРОВАЛИ НАУЧНЫЙ И ТЕХНОЛОГИЧЕСКИЙ ОБЛИК УЛЬЯНОВСКОЙ ОБЛАСТИ.\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(text)\n",
|
|||
|
"print(text.lower())\n",
|
|||
|
"print(text.upper())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Регулярные выражения"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\n",
|
|||
|
"Накануне Дня российской науки февраля в Ульяновском государственном техническом университете открыли новую экспозицию Они стояли у истоков ульяновской науки Она посвящена выдающимся ученым XX начала XXI веков чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import re\n",
|
|||
|
"\n",
|
|||
|
"regex = re.compile(\"[^a-zA-Zа-яА-Я ]\")\n",
|
|||
|
"print(text)\n",
|
|||
|
"print(\" \".join(regex.sub(\"\", text).split()))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Эмодзи"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Пример эмодзи 👍\n",
|
|||
|
"Пример эмодзи :thumbs_up:\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import emoji\n",
|
|||
|
"\n",
|
|||
|
"print(emoji.emojize(\"Пример эмодзи :thumbs_up:\"))\n",
|
|||
|
"print(emoji.demojize(\"Пример эмодзи 👍\"))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Диакритические знаки"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"stavanger Накануне Дня россиискои науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновскои науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научныи и технологическии облик Ульяновскои области.\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import unicodedata\n",
|
|||
|
"\n",
|
|||
|
"norm_text = unicodedata.normalize(\"NFKD\", f\"stävänger {text}\")\n",
|
|||
|
"\n",
|
|||
|
"res = \"\".join([char for char in norm_text if not unicodedata.combining(char)])\n",
|
|||
|
"\n",
|
|||
|
"print(res)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Преобразование числа в строку"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['Накануне', 'Дня', 'российской', 'науки', ',', 'семь', 'февраля', ',', 'в', 'Ульяновском', 'государственном', 'техническом', 'университете', 'открыли', 'новую', 'экспозицию', '«', 'Они', 'стояли', 'у', 'истоков', 'ульяновской', 'науки', '»', '.', 'Она', 'посвящена', 'выдающимся', 'ученым', 'двадцать', '–', 'начала', 'двадцать один', 'веков', ',', 'чьи', 'достижения', 'и', 'изобретения', 'сформировали', 'научный', 'и', 'технологический', 'облик', 'Ульяновской', 'области', '.']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"times = {\"7\": \"семь\", \"XX\": \"двадцать\", \"XXI\": \"двадцать один\"}\n",
|
|||
|
"\n",
|
|||
|
"print([times[token.text] if token.text in times else token.text for token in sp(text)])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Токенизация"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\n",
|
|||
|
",\n",
|
|||
|
"False\n",
|
|||
|
"False\n",
|
|||
|
"True\n",
|
|||
|
"\n",
|
|||
|
"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки».\n",
|
|||
|
"Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\n",
|
|||
|
"\n",
|
|||
|
"Накануне\n",
|
|||
|
"Дня\n",
|
|||
|
"российской\n",
|
|||
|
"науки\n",
|
|||
|
",\n",
|
|||
|
"7\n",
|
|||
|
"февраля\n",
|
|||
|
",\n",
|
|||
|
"в\n",
|
|||
|
"Ульяновском\n",
|
|||
|
"государственном\n",
|
|||
|
"техническом\n",
|
|||
|
"университете\n",
|
|||
|
"открыли\n",
|
|||
|
"новую\n",
|
|||
|
"экспозицию\n",
|
|||
|
"«\n",
|
|||
|
"Они\n",
|
|||
|
"стояли\n",
|
|||
|
"у\n",
|
|||
|
"истоков\n",
|
|||
|
"ульяновской\n",
|
|||
|
"науки\n",
|
|||
|
"»\n",
|
|||
|
".\n",
|
|||
|
"Она\n",
|
|||
|
"посвящена\n",
|
|||
|
"выдающимся\n",
|
|||
|
"ученым\n",
|
|||
|
"XX\n",
|
|||
|
"–\n",
|
|||
|
"начала\n",
|
|||
|
"XXI\n",
|
|||
|
"веков\n",
|
|||
|
",\n",
|
|||
|
"чьи\n",
|
|||
|
"достижения\n",
|
|||
|
"и\n",
|
|||
|
"изобретения\n",
|
|||
|
"сформировали\n",
|
|||
|
"научный\n",
|
|||
|
"и\n",
|
|||
|
"технологический\n",
|
|||
|
"облик\n",
|
|||
|
"Ульяновской\n",
|
|||
|
"области\n",
|
|||
|
".\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"doc = sp(text)\n",
|
|||
|
"\n",
|
|||
|
"print(doc)\n",
|
|||
|
"print(doc[4])\n",
|
|||
|
"print(doc[4].is_sent_start)\n",
|
|||
|
"print(doc[4].is_sent_end)\n",
|
|||
|
"print(doc[4].is_punct)\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"for sentence in doc.sents:\n",
|
|||
|
" print(sentence)\n",
|
|||
|
"\n",
|
|||
|
"print()\n",
|
|||
|
"\n",
|
|||
|
"for sentence in doc.sents:\n",
|
|||
|
" for word in sentence:\n",
|
|||
|
" print(word)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Удаление стоп-слов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Список стоп-слов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['а', 'авось', 'ага', 'агу', 'аж', 'ай', 'али', 'алло', 'ау', 'ах', 'ая', 'б', 'бац', 'без', 'безусловно', 'бишь', 'благо', 'благодаря', 'ближайшие', 'близко', 'более', 'больше', 'будем', 'будет', 'будете', 'будешь', 'будто', 'буду', 'будут', 'будучи', 'будь', 'будьте', 'бы', 'бывает', 'бывала', 'бывали', 'бываю', 'бывают', 'был', 'была', 'были', 'было', 'бытует', 'быть', 'в', 'вам', 'вами', 'вас', 'ваш', 'ваша', 'ваше', 'ваши', 'вдали', 'вдобавок', 'вдруг', 'ведь', 'везде', 'вернее', 'весь', 'взаимно', 'взаправду', 'видно', 'вишь', 'включая', 'вместо', 'внакладе', 'вначале', 'вне', 'вниз', 'внизу', 'вновь', 'во', 'вовсе', 'возможно', 'воистину', 'вокруг', 'вон', 'вообще', 'вопреки', 'вот', 'вперекор', 'вплоть', 'вполне', 'вправду', 'вправе', 'впрочем', 'впрямь', 'вресноту', 'вроде', 'вряд', 'все', 'всегда', 'всего', 'всей', 'всем', 'всеми', 'всему', 'всех', 'всею', 'всея', 'всю', 'всюду', 'вся', 'всякий', 'всякого', 'всякой', 'всячески', 'всё', 'всём', 'вчеред', 'вы', 'г', 'гав', 'где', 'го', 'гораздо', 'д', 'да', 'дабы', 'давайте', 'давно', 'давным', 'даже', 'далее', 'далеко', 'дальше', 'данная', 'данного', 'данное', 'данной', 'данном', 'данному', 'данные', 'данный', 'данных', 'дану', 'данунах', 'даром', 'де', 'действительно', 'для', 'до', 'довольно', 'доколе', 'доколь', 'долго', 'должен', 'должна', 'должно', 'должны', 'должный', 'дополнительно', 'другая', 'другие', 'другим', 'другими', 'других', 'другое', 'другой', 'е', 'его', 'едва', 'едим', 'едят', 'ее', 'ежели', 'ей', 'ел', 'ела', 'еле', 'ем', 'ему', 'емъ', 'если', 'ест', 'есть', 'ешь', 'еще', 'ещё', 'ею', 'её', 'ж', 'же', 'з', 'за', 'затем', 'зато', 'зачем', 'здесь', 'значит', 'зря', 'и', 'ибо', 'из', 'или', 'иль', 'им', 'имеет', 'имел', 'имела', 'имело', 'именно', 'иметь', 'ими', 'имъ', 'иначе', 'иногда', 'иным', 'иными', 'итак', 'их', 'ишь', 'й', 'к', 'ка', 'кабы', 'каждая', 'каждое', 'каждые', 'каждый', 'кажется', 'казалась', 'казались', 'казалось', 'казался', 'казаться', 'как', 'какая', 'какие', 'каким', 'какими', 'каков', 'какого', 'какой', 'какому', 'какою', 'касательно', 'кем', 'ко', 'когда', 'кого', 'кой', 'коли', 'коль', 'ком', 'кому', 'комья', 'конечно', 'короче', 'которая', 'которого', 'которое', 'которой', 'котором', 'которому', 'которою', 'которую', 'которые', 'который', 'которым', 'которыми', 'которых', 'кроме', 'кстати', 'кто', 'ку', 'куда', 'л', 'ли', 'либо', 'лишь', 'любая', 'любого', 'любое', 'любой', 'любом', 'любую', 'любыми', 'любых', 'м', 'мало', 'меж', 'между', 'менее', 'меньше', 'меня', 'мимо', 'мне', 'многие', 'много', 'многого', 'многое', 'многом', 'многому', 'мной', 'мною', 'мог', 'моги', 'могите', 'могла', 'могли',
|
|||
|
"['!', '\"', '#', '$', '%', '&', \"'\", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\\\', ']', '^', '_', '`', '{', '|', '}', '~']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from string import punctuation\n",
|
|||
|
"\n",
|
|||
|
"print(sorted(sp.Defaults.stop_words))\n",
|
|||
|
"print(list(punctuation))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Найденные в тексте стоп-слова"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['Накануне', ',', '7', ',', 'в', '«', 'Они', 'у', '»', '.', 'Она', '–', 'начала', ',', 'и', 'и', '.']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"stops = [token.text for token in doc if token.is_stop or token.is_punct or token.is_digit]\n",
|
|||
|
"print(stops)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Текст без стоп-слов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"[Дня, российской, науки, февраля, Ульяновском, государственном, техническом, университете, открыли, новую, экспозицию, стояли, истоков, ульяновской, науки, посвящена, выдающимся, ученым, XX, XXI, веков, чьи, достижения, изобретения, сформировали, научный, технологический, облик, Ульяновской, области]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"without_stops = [token for token in doc if not token.is_stop and not token.is_punct and not token.is_digit]\n",
|
|||
|
"print(without_stops)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Стемминг"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Стеммер Портера"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['дня', 'российской', 'науки', 'февраля', 'ульяновском', 'государственном', 'техническом', 'университете', 'открыли', 'новую', 'экспозицию', 'стояли', 'истоков', 'ульяновской', 'науки', 'посвящена', 'выдающимся', 'ученым', 'xx', 'xxi', 'веков', 'чьи', 'достижения', 'изобретения', 'сформировали', 'научный', 'технологический', 'облик', 'ульяновской', 'области']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from nltk.stem.porter import PorterStemmer\n",
|
|||
|
"\n",
|
|||
|
"porter = PorterStemmer()\n",
|
|||
|
"\n",
|
|||
|
"print(list(map(porter.stem, map(str, without_stops))))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Стеммер Snowball"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['дня', 'российск', 'наук', 'феврал', 'ульяновск', 'государствен', 'техническ', 'университет', 'откр', 'нов', 'экспозиц', 'стоя', 'исток', 'ульяновск', 'наук', 'посвящ', 'выда', 'учен', 'XX', 'XXI', 'век', 'чьи', 'достижен', 'изобретен', 'сформирова', 'научн', 'технологическ', 'облик', 'ульяновск', 'област']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from nltk.stem.snowball import SnowballStemmer\n",
|
|||
|
"\n",
|
|||
|
"snowball = SnowballStemmer(language=\"russian\")\n",
|
|||
|
"\n",
|
|||
|
"print(list(map(snowball.stem, map(str, without_stops))))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Лемматизация"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['день', 'российский', 'наука', 'февраль', 'ульяновский', 'государственный', 'технический', 'университет', 'открыть', 'новый', 'экспозиция', 'стоять', 'исток', 'ульяновский', 'наука', 'посвятить', 'выдающийся', 'учёный', 'xx', 'xxi', 'век', 'чей', 'достижение', 'изобретение', 'сформировать', 'научный', 'технологический', 'облик', 'ульяновский', 'область']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print([token.lemma_ for token in without_stops])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Морфологический анализ (POS tagging)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['Дня PROPN', 'российской ADJ', 'науки NOUN', 'февраля NOUN', 'Ульяновском ADJ', 'государственном ADJ', 'техническом ADJ', 'университете NOUN', 'открыли VERB', 'новую ADJ', 'экспозицию NOUN', 'стояли VERB', 'истоков NOUN', 'ульяновской ADJ', 'науки NOUN', 'посвящена VERB', 'выдающимся ADJ', 'ученым NOUN', 'XX ADJ', 'XXI ADJ', 'веков NOUN', 'чьи DET', 'достижения NOUN', 'изобретения NOUN', 'сформировали VERB', 'научный ADJ', 'технологический ADJ', 'облик NOUN', 'Ульяновской ADJ', 'области NOUN']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print([f\"{token.text} {token.pos_}\" for token in without_stops])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['Дня [Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing]', 'российской [Case=Gen|Degree=Pos|Gender=Fem|Number=Sing]', 'науки [Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing]', 'февраля [Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing]', 'Ульяновском [Case=Loc|Degree=Pos|Gender=Masc|Number=Sing]', 'государственном [Case=Loc|Degree=Pos|Gender=Masc|Number=Sing]', 'техническом [Case=Loc|Degree=Pos|Gender=Masc|Number=Sing]', 'университете [Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing]', 'открыли [Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act]', 'новую [Case=Acc|Degree=Pos|Gender=Fem|Number=Sing]', 'экспозицию [Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing]', 'стояли [Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act]', 'истоков [Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur]', 'ульяновской [Case=Gen|Degree=Pos|Gender=Fem|Number=Sing]', 'науки [Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing]', 'посвящена [Aspect=Perf|Gender=Fem|Number=Sing|StyleVariant=Short|Tense=Past|VerbForm=Part|Voice=Pass]', 'выдающимся [Case=Dat|Degree=Pos|Number=Plur]', 'ученым [Animacy=Anim|Case=Dat|Gender=Masc|Number=Plur]', 'XX []', 'XXI []', 'веков [Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur]', 'чьи [Case=Nom|Number=Plur]', 'достижения [Animacy=Inan|Case=Nom|Gender=Neut|Number=Plur]', 'изобретения [Animacy=Inan|Case=Nom|Gender=Neut|Number=Plur]', 'сформировали [Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act]', 'научный [Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing]', 'технологический [Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing]', 'облик [Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing]', 'Ульяновской [Case=Gen|Degree=Pos|Gender=Fem|Number=Sing]', 'области [Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing]']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print([f\"{token.text} [{token.morph}]\" for token in without_stops])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Синтаксический анализ (Dependency parsing)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Накануне Дня российской науки, 7 февраля, в Ульяновском государственном техническом университете открыли новую экспозицию «Они стояли у истоков ульяновской науки». Она посвящена выдающимся ученым XX – начала XXI веков, чьи достижения и изобретения сформировали научный и технологический облик Ульяновской области.\n",
|
|||
|
"Накануне | case | Дня | PROPN []\n",
|
|||
|
"Дня | obl | открыли | VERB ['Накануне', 'науки']\n",
|
|||
|
"российской | amod | науки | NOUN []\n",
|
|||
|
"науки | nmod | Дня | PROPN ['российской']\n",
|
|||
|
", | punct | 7 | ADJ []\n",
|
|||
|
"7 | obl | открыли | VERB [',', 'февраля', ',']\n",
|
|||
|
"февраля | flat | 7 | ADJ []\n",
|
|||
|
", | punct | 7 | ADJ []\n",
|
|||
|
"в | case | университете | NOUN []\n",
|
|||
|
"Ульяновском | amod | университете | NOUN []\n",
|
|||
|
"государственном | amod | университете | NOUN []\n",
|
|||
|
"техническом | amod | университете | NOUN []\n",
|
|||
|
"университете | obl | открыли | VERB ['в', 'Ульяновском', 'государственном', 'техническом']\n",
|
|||
|
"открыли | ROOT | открыли | VERB ['Дня', '7', 'университете', 'экспозицию', 'стояли']\n",
|
|||
|
"новую | amod | экспозицию | NOUN []\n",
|
|||
|
"экспозицию | obj | открыли | VERB ['новую']\n",
|
|||
|
"« | punct | стояли | VERB []\n",
|
|||
|
"Они | nsubj | стояли | VERB []\n",
|
|||
|
"стояли | parataxis | открыли | VERB ['«', 'Они', 'истоков', '»']\n",
|
|||
|
"у | case | истоков | NOUN []\n",
|
|||
|
"истоков | obl | стояли | VERB ['у', 'науки']\n",
|
|||
|
"ульяновской | amod | науки | NOUN []\n",
|
|||
|
"науки | nmod | истоков | NOUN ['ульяновской']\n",
|
|||
|
"» | punct | стояли | VERB []\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(text)\n",
|
|||
|
"print(\n",
|
|||
|
" \"\\n\".join(\n",
|
|||
|
" [\n",
|
|||
|
" f\"{token.text} | {token.dep_} | {token.head.text} | {token.head.pos_} {[child.text for child in token.children]}\"\n",
|
|||
|
" for token in sp(text.split(\".\")[0])\n",
|
|||
|
" ]\n",
|
|||
|
" )\n",
|
|||
|
")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"ru\" id=\"d37f85f46121492297d2fe4e124bade0-0\" class=\"displacy\" width=\"3550\" height=\"662.0\" direction=\"ltr\" style=\"max-width: none; height: 662.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Накануне</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADP</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"225\">Дня</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"225\">PROPN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"400\">российской</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"400\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"575\">науки,</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"575\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">7</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"925\">февраля,</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"925\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">в</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">ADP</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1275\">Ульяновском</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1275\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1450\">государственном</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1450\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1625\">техническом</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1625\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1800\">университете</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1800\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1975\">открыли</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1975\">VERB</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">новую</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2325\">экспозицию «</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2325\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2500\">Они</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2500\">PRON</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2675\">стояли</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2675\">VERB</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2850\">у</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2850\">ADP</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3025\">истоков</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3025\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3200\">ульяновской</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3200\">ADJ</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
|
|||
|
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3375\">науки»</tspan>\n",
|
|||
|
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3375\">NOUN</tspan>\n",
|
|||
|
"</text>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-0\" stroke-width=\"2px\" d=\"M70,527.0 C70,439.5 200.0,439.5 200.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M70,529.0 L62,517.0 78,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-1\" stroke-width=\"2px\" d=\"M245,527.0 C245,2.0 1975.0,2.0 1975.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">obl</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M245,529.0 L237,517.0 253,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-2\" stroke-width=\"2px\" d=\"M420,527.0 C420,439.5 550.0,439.5 550.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M420,529.0 L412,517.0 428,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-3\" stroke-width=\"2px\" d=\"M245,527.0 C245,352.0 555.0,352.0 555.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M555.0,529.0 L563.0,517.0 547.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-4\" stroke-width=\"2px\" d=\"M770,527.0 C770,89.5 1970.0,89.5 1970.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">obl</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M770,529.0 L762,517.0 778,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-5\" stroke-width=\"2px\" d=\"M770,527.0 C770,439.5 900.0,439.5 900.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">flat</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M900.0,529.0 L908.0,517.0 892.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-6\" stroke-width=\"2px\" d=\"M1120,527.0 C1120,177.0 1790.0,177.0 1790.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M1120,529.0 L1112,517.0 1128,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-7\" stroke-width=\"2px\" d=\"M1295,527.0 C1295,264.5 1785.0,264.5 1785.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M1295,529.0 L1287,517.0 1303,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-8\" stroke-width=\"2px\" d=\"M1470,527.0 C1470,352.0 1780.0,352.0 1780.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M1470,529.0 L1462,517.0 1478,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-9\" stroke-width=\"2px\" d=\"M1645,527.0 C1645,439.5 1775.0,439.5 1775.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M1645,529.0 L1637,517.0 1653,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-10\" stroke-width=\"2px\" d=\"M1820,527.0 C1820,439.5 1950.0,439.5 1950.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-10\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">obl</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M1820,529.0 L1812,517.0 1828,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-11\" stroke-width=\"2px\" d=\"M2170,527.0 C2170,439.5 2300.0,439.5 2300.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-11\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M2170,529.0 L2162,517.0 2178,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-12\" stroke-width=\"2px\" d=\"M1995,527.0 C1995,352.0 2305.0,352.0 2305.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-12\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">obj</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M2305.0,529.0 L2313.0,517.0 2297.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-13\" stroke-width=\"2px\" d=\"M2520,527.0 C2520,439.5 2650.0,439.5 2650.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-13\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M2520,529.0 L2512,517.0 2528,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-14\" stroke-width=\"2px\" d=\"M1995,527.0 C1995,264.5 2660.0,264.5 2660.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-14\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">parataxis</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M2660.0,529.0 L2668.0,517.0 2652.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-15\" stroke-width=\"2px\" d=\"M2870,527.0 C2870,439.5 3000.0,439.5 3000.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-15\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M2870,529.0 L2862,517.0 2878,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-16\" stroke-width=\"2px\" d=\"M2695,527.0 C2695,352.0 3005.0,352.0 3005.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-16\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">obl</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M3005.0,529.0 L3013.0,517.0 2997.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-17\" stroke-width=\"2px\" d=\"M3220,527.0 C3220,439.5 3350.0,439.5 3350.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-17\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M3220,529.0 L3212,517.0 3228,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"\n",
|
|||
|
"<g class=\"displacy-arrow\">\n",
|
|||
|
" <path class=\"displacy-arc\" id=\"arrow-d37f85f46121492297d2fe4e124bade0-0-18\" stroke-width=\"2px\" d=\"M3045,527.0 C3045,352.0 3355.0,352.0 3355.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
|||
|
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
|||
|
" <textPath xlink:href=\"#arrow-d37f85f46121492297d2fe4e124bade0-0-18\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
|
|||
|
" </text>\n",
|
|||
|
" <path class=\"displacy-arrowhead\" d=\"M3355.0,529.0 L3363.0,517.0 3347.0,517.0\" fill=\"currentColor\"/>\n",
|
|||
|
"</g>\n",
|
|||
|
"</svg></span>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from spacy import displacy\n",
|
|||
|
"\n",
|
|||
|
"displacy.render(sp(text.split(\".\")[0]), style=\"dep\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Обнаружение именованных сущностей"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['ульяновский государственный технический университет ORG', 'ульяновский область LOC']\n",
|
|||
|
"['ульяновский институт гражданский авиация имя главный маршал авиация ORG', 'б.п.бугаева PER', 'сергей косачевскому PER', 'ульяновский государственный педагогический университет имя и.н.ульянова ORG', 'светлане богатов PER', 'ульяновский государственный аграрный университет имя п.а.столыпина ORG', 'винере насыровой PER', 'ульяновский государственный университет ORG', 'татьяна лисовой PER']\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print([f\"{entity.lemma_} {entity.label_}\" for entity in sp(text).ents])\n",
|
|||
|
"print([f\"{entity.lemma_} {entity.label_}\" for entity in sp(text_ner).ents])"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">В рамках торжественного открытия экспозиции за активное участие в подготовке материалов были вручены благодарственные письма профессору кафедры летной эксплуатации и безопасности полетов \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Ульяновского института гражданской авиации имени Главного маршала авиации\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
" \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Б.П.Бугаева\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PER</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
" \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Сергею Косачевскому\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PER</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
", начальнику управления научно-исследовательской и инновационной деятельности \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Ульяновского государственного педагогического университета имени И.Н.Ульянова\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
" \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Светлане Богатовой\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PER</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
", руководителю пресс-службы \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Ульяновского государственного аграрного университета имени П.А.Столыпина\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
" \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Винере Насыровой\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PER</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
", помощнику проректора по научной работе \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Ульяновского государственного университета\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
" \n",
|
|||
|
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
|||
|
" Татьяне Лисовой\n",
|
|||
|
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PER</span>\n",
|
|||
|
"</mark>\n",
|
|||
|
".</div></span>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"<IPython.core.display.HTML object>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"displacy.render(sp(text_ner), style=\"ent\")"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Векторизация"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Мешок слов (BoW, Bag of Words)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>xx</th>\n",
|
|||
|
" <th>xxi</th>\n",
|
|||
|
" <th>авиации</th>\n",
|
|||
|
" <th>аграрного</th>\n",
|
|||
|
" <th>активное</th>\n",
|
|||
|
" <th>безопасности</th>\n",
|
|||
|
" <th>благодарственные</th>\n",
|
|||
|
" <th>богатовой</th>\n",
|
|||
|
" <th>бугаева</th>\n",
|
|||
|
" <th>были</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>университета</th>\n",
|
|||
|
" <th>университете</th>\n",
|
|||
|
" <th>управления</th>\n",
|
|||
|
" <th>участие</th>\n",
|
|||
|
" <th>ученым</th>\n",
|
|||
|
" <th>февраля</th>\n",
|
|||
|
" <th>чьи</th>\n",
|
|||
|
" <th>эксплуатации</th>\n",
|
|||
|
" <th>экспозиции</th>\n",
|
|||
|
" <th>экспозицию</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>2 rows × 87 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" xx xxi авиации аграрного активное безопасности \\\n",
|
|||
|
"text 1 1 0 0 0 0 \n",
|
|||
|
"text_ner 0 0 2 1 1 1 \n",
|
|||
|
"\n",
|
|||
|
" благодарственные богатовой бугаева были ... университета \\\n",
|
|||
|
"text 0 0 0 0 ... 0 \n",
|
|||
|
"text_ner 1 1 1 1 ... 3 \n",
|
|||
|
"\n",
|
|||
|
" университете управления участие ученым февраля чьи \\\n",
|
|||
|
"text 1 0 0 1 1 1 \n",
|
|||
|
"text_ner 0 1 1 0 0 0 \n",
|
|||
|
"\n",
|
|||
|
" эксплуатации экспозиции экспозицию \n",
|
|||
|
"text 0 0 1 \n",
|
|||
|
"text_ner 1 1 0 \n",
|
|||
|
"\n",
|
|||
|
"[2 rows x 87 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from scipy import sparse\n",
|
|||
|
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
|||
|
"\n",
|
|||
|
"counts_vectorizer = CountVectorizer()\n",
|
|||
|
"counts_matrix = sparse.csr_matrix(counts_vectorizer.fit_transform([text, text_ner]))\n",
|
|||
|
"counts_df = pd.DataFrame(\n",
|
|||
|
" counts_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\"],\n",
|
|||
|
" columns=counts_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"counts_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропуск термов, которые содержатся не менее чем в двух документах"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>xx</th>\n",
|
|||
|
" <th>xxi</th>\n",
|
|||
|
" <th>авиации</th>\n",
|
|||
|
" <th>аграрного</th>\n",
|
|||
|
" <th>активное</th>\n",
|
|||
|
" <th>безопасности</th>\n",
|
|||
|
" <th>благодарственные</th>\n",
|
|||
|
" <th>богатовой</th>\n",
|
|||
|
" <th>бугаева</th>\n",
|
|||
|
" <th>были</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>университета</th>\n",
|
|||
|
" <th>университете</th>\n",
|
|||
|
" <th>управления</th>\n",
|
|||
|
" <th>участие</th>\n",
|
|||
|
" <th>ученым</th>\n",
|
|||
|
" <th>февраля</th>\n",
|
|||
|
" <th>чьи</th>\n",
|
|||
|
" <th>эксплуатации</th>\n",
|
|||
|
" <th>экспозиции</th>\n",
|
|||
|
" <th>экспозицию</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text1</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner1</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text2</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner2</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>6 rows × 87 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" xx xxi авиации аграрного активное безопасности \\\n",
|
|||
|
"text 1 1 0 0 0 0 \n",
|
|||
|
"text_ner 0 0 2 1 1 1 \n",
|
|||
|
"text1 1 1 0 0 0 0 \n",
|
|||
|
"text_ner1 0 0 2 1 1 1 \n",
|
|||
|
"text2 1 1 0 0 0 0 \n",
|
|||
|
"text_ner2 0 0 2 1 1 1 \n",
|
|||
|
"\n",
|
|||
|
" благодарственные богатовой бугаева были ... университета \\\n",
|
|||
|
"text 0 0 0 0 ... 0 \n",
|
|||
|
"text_ner 1 1 1 1 ... 3 \n",
|
|||
|
"text1 0 0 0 0 ... 0 \n",
|
|||
|
"text_ner1 1 1 1 1 ... 3 \n",
|
|||
|
"text2 0 0 0 0 ... 0 \n",
|
|||
|
"text_ner2 1 1 1 1 ... 3 \n",
|
|||
|
"\n",
|
|||
|
" университете управления участие ученым февраля чьи \\\n",
|
|||
|
"text 1 0 0 1 1 1 \n",
|
|||
|
"text_ner 0 1 1 0 0 0 \n",
|
|||
|
"text1 1 0 0 1 1 1 \n",
|
|||
|
"text_ner1 0 1 1 0 0 0 \n",
|
|||
|
"text2 1 0 0 1 1 1 \n",
|
|||
|
"text_ner2 0 1 1 0 0 0 \n",
|
|||
|
"\n",
|
|||
|
" эксплуатации экспозиции экспозицию \n",
|
|||
|
"text 0 0 1 \n",
|
|||
|
"text_ner 1 1 0 \n",
|
|||
|
"text1 0 0 1 \n",
|
|||
|
"text_ner1 1 1 0 \n",
|
|||
|
"text2 0 0 1 \n",
|
|||
|
"text_ner2 1 1 0 \n",
|
|||
|
"\n",
|
|||
|
"[6 rows x 87 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"counts_min_vectorizer = CountVectorizer(min_df=2)\n",
|
|||
|
"counts_min_matrix = sparse.csr_matrix(\n",
|
|||
|
" counts_min_vectorizer.fit_transform(\n",
|
|||
|
" [text, text_ner, text, text_ner, text, text_ner]\n",
|
|||
|
" )\n",
|
|||
|
")\n",
|
|||
|
"counts_min_df = pd.DataFrame(\n",
|
|||
|
" counts_min_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\", \"text1\", \"text_ner1\", \"text2\", \"text_ner2\"],\n",
|
|||
|
" columns=counts_min_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"counts_min_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Пропуск термов, которые содержатся более чем в двух документах"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"ename": "ValueError",
|
|||
|
"evalue": "After pruning, no terms remain. Try a lower min_df or a higher max_df.",
|
|||
|
"output_type": "error",
|
|||
|
"traceback": [
|
|||
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|||
|
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
|
|||
|
"Cell \u001b[0;32mIn[23], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m counts_max_vectorizer \u001b[38;5;241m=\u001b[39m CountVectorizer(max_df\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m)\n\u001b[1;32m 2\u001b[0m counts_max_matrix \u001b[38;5;241m=\u001b[39m sparse\u001b[38;5;241m.\u001b[39mcsr_matrix(\n\u001b[0;32m----> 3\u001b[0m \u001b[43mcounts_max_vectorizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext_ner\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext_ner\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext_ner\u001b[49m\u001b[43m]\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6\u001b[0m )\n\u001b[1;32m 7\u001b[0m counts_max_df \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame(\n\u001b[1;32m 8\u001b[0m counts_max_matrix\u001b[38;5;241m.\u001b[39mtoarray(),\n\u001b[1;32m 9\u001b[0m index\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext_ner\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext1\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext_ner1\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext2\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext_ner2\u001b[39m\u001b[38;5;124m\"\u001b[39m],\n\u001b[1;32m 10\u001b[0m columns\u001b[38;5;241m=\u001b[39mcounts_max_vectorizer\u001b[38;5;241m.\u001b[39mget_feature_names_out(),\n\u001b[1;32m 11\u001b[0m )\n\u001b[1;32m 12\u001b[0m counts_max_df\n",
|
|||
|
"File \u001b[0;32m~/Projects/python/ckmai/.venv/lib/python3.12/site-packages/sklearn/base.py:1473\u001b[0m, in \u001b[0;36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1466\u001b[0m estimator\u001b[38;5;241m.\u001b[39m_validate_params()\n\u001b[1;32m 1468\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[1;32m 1469\u001b[0m skip_parameter_validation\u001b[38;5;241m=\u001b[39m(\n\u001b[1;32m 1470\u001b[0m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[1;32m 1471\u001b[0m )\n\u001b[1;32m 1472\u001b[0m ):\n\u001b[0;32m-> 1473\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
|||
|
"File \u001b[0;32m~/Projects/python/ckmai/.venv/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1385\u001b[0m, in \u001b[0;36mCountVectorizer.fit_transform\u001b[0;34m(self, raw_documents, y)\u001b[0m\n\u001b[1;32m 1383\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m max_features \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1384\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_sort_features(X, vocabulary)\n\u001b[0;32m-> 1385\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_limit_features\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1386\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvocabulary\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmax_doc_count\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmin_doc_count\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmax_features\u001b[49m\n\u001b[1;32m 1387\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1388\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m max_features \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 1389\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_sort_features(X, vocabulary)\n",
|
|||
|
"File \u001b[0;32m~/Projects/python/ckmai/.venv/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:1237\u001b[0m, in \u001b[0;36mCountVectorizer._limit_features\u001b[0;34m(self, X, vocabulary, high, low, limit)\u001b[0m\n\u001b[1;32m 1235\u001b[0m kept_indices \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mwhere(mask)[\u001b[38;5;241m0\u001b[39m]\n\u001b[1;32m 1236\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(kept_indices) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m-> 1237\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 1238\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAfter pruning, no terms remain. Try a lower min_df or a higher max_df.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 1239\u001b[0m )\n\u001b[1;32m 1240\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m X[:, kept_indices]\n",
|
|||
|
"\u001b[0;31mValueError\u001b[0m: After pruning, no terms remain. Try a lower min_df or a higher max_df."
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"counts_max_vectorizer = CountVectorizer(max_df=2)\n",
|
|||
|
"counts_max_matrix = sparse.csr_matrix(\n",
|
|||
|
" counts_max_vectorizer.fit_transform(\n",
|
|||
|
" [text, text_ner, text, text_ner, text, text_ner]\n",
|
|||
|
" )\n",
|
|||
|
")\n",
|
|||
|
"counts_max_df = pd.DataFrame(\n",
|
|||
|
" counts_max_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\", \"text1\", \"text_ner1\", \"text2\", \"text_ner2\"],\n",
|
|||
|
" columns=counts_max_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"counts_max_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"n-граммы"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>xx</th>\n",
|
|||
|
" <th>xx начала</th>\n",
|
|||
|
" <th>xxi</th>\n",
|
|||
|
" <th>xxi веков</th>\n",
|
|||
|
" <th>авиации</th>\n",
|
|||
|
" <th>авиации бугаева</th>\n",
|
|||
|
" <th>авиации имени</th>\n",
|
|||
|
" <th>аграрного</th>\n",
|
|||
|
" <th>аграрного университета</th>\n",
|
|||
|
" <th>активное</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>февраля</th>\n",
|
|||
|
" <th>февраля ульяновском</th>\n",
|
|||
|
" <th>чьи</th>\n",
|
|||
|
" <th>чьи достижения</th>\n",
|
|||
|
" <th>эксплуатации</th>\n",
|
|||
|
" <th>эксплуатации безопасности</th>\n",
|
|||
|
" <th>экспозиции</th>\n",
|
|||
|
" <th>экспозиции за</th>\n",
|
|||
|
" <th>экспозицию</th>\n",
|
|||
|
" <th>экспозицию они</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>2 rows × 181 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" xx xx начала xxi xxi веков авиации авиации бугаева \\\n",
|
|||
|
"text 1 1 1 1 0 0 \n",
|
|||
|
"text_ner 0 0 0 0 2 1 \n",
|
|||
|
"\n",
|
|||
|
" авиации имени аграрного аграрного университета активное ... \\\n",
|
|||
|
"text 0 0 0 0 ... \n",
|
|||
|
"text_ner 1 1 1 1 ... \n",
|
|||
|
"\n",
|
|||
|
" февраля февраля ульяновском чьи чьи достижения эксплуатации \\\n",
|
|||
|
"text 1 1 1 1 0 \n",
|
|||
|
"text_ner 0 0 0 0 1 \n",
|
|||
|
"\n",
|
|||
|
" эксплуатации безопасности экспозиции экспозиции за экспозицию \\\n",
|
|||
|
"text 0 0 0 1 \n",
|
|||
|
"text_ner 1 1 1 0 \n",
|
|||
|
"\n",
|
|||
|
" экспозицию они \n",
|
|||
|
"text 1 \n",
|
|||
|
"text_ner 0 \n",
|
|||
|
"\n",
|
|||
|
"[2 rows x 181 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 24,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"counts_ng_vectorizer = CountVectorizer(ngram_range=(1, 2))\n",
|
|||
|
"counts_ng_matrix = sparse.csr_matrix(\n",
|
|||
|
" counts_ng_vectorizer.fit_transform([text, text_ner])\n",
|
|||
|
")\n",
|
|||
|
"counts_ng_df = pd.DataFrame(\n",
|
|||
|
" counts_ng_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\"],\n",
|
|||
|
" columns=counts_ng_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"counts_ng_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Ограничение количества термов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>xx</th>\n",
|
|||
|
" <th>авиации</th>\n",
|
|||
|
" <th>государственного</th>\n",
|
|||
|
" <th>имени</th>\n",
|
|||
|
" <th>науки</th>\n",
|
|||
|
" <th>пресс</th>\n",
|
|||
|
" <th>проректора</th>\n",
|
|||
|
" <th>профессору</th>\n",
|
|||
|
" <th>работе</th>\n",
|
|||
|
" <th>рамках</th>\n",
|
|||
|
" <th>российской</th>\n",
|
|||
|
" <th>руководителю</th>\n",
|
|||
|
" <th>ульяновского</th>\n",
|
|||
|
" <th>ульяновской</th>\n",
|
|||
|
" <th>университета</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner</th>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>3</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" xx авиации государственного имени науки пресс проректора \\\n",
|
|||
|
"text 1 0 0 0 2 0 0 \n",
|
|||
|
"text_ner 0 2 3 3 0 1 1 \n",
|
|||
|
"\n",
|
|||
|
" профессору работе рамках российской руководителю ульяновского \\\n",
|
|||
|
"text 0 0 0 1 0 0 \n",
|
|||
|
"text_ner 1 1 1 0 1 4 \n",
|
|||
|
"\n",
|
|||
|
" ульяновской университета \n",
|
|||
|
"text 2 0 \n",
|
|||
|
"text_ner 0 3 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 25,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"counts_mf_vectorizer = CountVectorizer(max_features=15)\n",
|
|||
|
"counts_mf_matrix = sparse.csr_matrix(\n",
|
|||
|
" counts_mf_vectorizer.fit_transform([text, text_ner])\n",
|
|||
|
")\n",
|
|||
|
"counts_mf_df = pd.DataFrame(\n",
|
|||
|
" counts_mf_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\"],\n",
|
|||
|
" columns=counts_mf_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"counts_mf_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Частотный портрет"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"$tfidf(t, d) = tf(t, d) * idf(t, D)$, \\\n",
|
|||
|
"где $tf(t, d)$ - частота терма $t$ в документе $d$; \\\n",
|
|||
|
"$idf$ - обратная частота терма (мера информативности терма $t$ в рамках всей коллекции документов $D$).\n",
|
|||
|
"\n",
|
|||
|
"$tf(t, d) = \\frac { f_{td} } { \\sum_{t' \\in d} f_{t'd} } $, \\\n",
|
|||
|
"где $f_{td}$ - количество терма $t$ в документе $d$; \\\n",
|
|||
|
"$\\sum_{t' \\in d} f_{t'd}$ - количество всех термов в документе $d$, является суммой $f_{td}$ всех термов документа $d$.\n",
|
|||
|
"\n",
|
|||
|
"$idf(t, D) = \\log ( \\frac { N + 1} { | { d : d \\in D, t \\in d } | +1 } $ ) , \\\n",
|
|||
|
"где $N$ - общее количество документов; \\\n",
|
|||
|
"$| { d : d \\in D, t \\in d } |$ - количество документов, в которых есть термин $t$.\n",
|
|||
|
"\n",
|
|||
|
"При использовании TfidfVectorizer $tf(t, d) = f_{td} $ (аналогично BoW).\n",
|
|||
|
"\n",
|
|||
|
"При включении параметра ``sublinear_tf=True`` $tf(t, d) = 1 + log ( f_{td} ) $, что уменьшает степень влияния терминов с высокой частотой на значение $tfidf(t, d)$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>xx</th>\n",
|
|||
|
" <th>xxi</th>\n",
|
|||
|
" <th>авиации</th>\n",
|
|||
|
" <th>аграрного</th>\n",
|
|||
|
" <th>активное</th>\n",
|
|||
|
" <th>безопасности</th>\n",
|
|||
|
" <th>благодарственные</th>\n",
|
|||
|
" <th>богатовой</th>\n",
|
|||
|
" <th>бугаева</th>\n",
|
|||
|
" <th>были</th>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <th>университета</th>\n",
|
|||
|
" <th>университете</th>\n",
|
|||
|
" <th>управления</th>\n",
|
|||
|
" <th>участие</th>\n",
|
|||
|
" <th>ученым</th>\n",
|
|||
|
" <th>февраля</th>\n",
|
|||
|
" <th>чьи</th>\n",
|
|||
|
" <th>эксплуатации</th>\n",
|
|||
|
" <th>экспозиции</th>\n",
|
|||
|
" <th>экспозицию</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.167287</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>text_ner</th>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.199854</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>0.247713</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.118037</td>\n",
|
|||
|
" <td>0.000000</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>2 rows × 87 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" xx xxi авиации аграрного активное безопасности \\\n",
|
|||
|
"text 0.167287 0.167287 0.000000 0.000000 0.000000 0.000000 \n",
|
|||
|
"text_ner 0.000000 0.000000 0.199854 0.118037 0.118037 0.118037 \n",
|
|||
|
"\n",
|
|||
|
" благодарственные богатовой бугаева были ... университета \\\n",
|
|||
|
"text 0.000000 0.000000 0.000000 0.000000 ... 0.000000 \n",
|
|||
|
"text_ner 0.118037 0.118037 0.118037 0.118037 ... 0.247713 \n",
|
|||
|
"\n",
|
|||
|
" университете управления участие ученым февраля чьи \\\n",
|
|||
|
"text 0.167287 0.000000 0.000000 0.167287 0.167287 0.167287 \n",
|
|||
|
"text_ner 0.000000 0.118037 0.118037 0.000000 0.000000 0.000000 \n",
|
|||
|
"\n",
|
|||
|
" эксплуатации экспозиции экспозицию \n",
|
|||
|
"text 0.000000 0.000000 0.167287 \n",
|
|||
|
"text_ner 0.118037 0.118037 0.000000 \n",
|
|||
|
"\n",
|
|||
|
"[2 rows x 87 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 26,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|||
|
"\n",
|
|||
|
"tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True)\n",
|
|||
|
"tfidf_matrix = sparse.csr_matrix(tfidf_vectorizer.fit_transform([text, text_ner]))\n",
|
|||
|
"tfidf_df = pd.DataFrame(\n",
|
|||
|
" tfidf_matrix.toarray(),\n",
|
|||
|
" index=[\"text\", \"text_ner\"],\n",
|
|||
|
" columns=tfidf_vectorizer.get_feature_names_out(),\n",
|
|||
|
")\n",
|
|||
|
"tfidf_df"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Эмбединги"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 27,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"['Накануне 0.0', 'Дня 0.0', 'российской 6.628802299499512', 'науки 6.097537994384766', ', 0.0', '7 0.0', 'февраля 5.738062381744385', ', 0.0', 'в 5.39321756362915', 'Ульяновском 0.0', 'государственном 6.0524067878723145', 'техническом 5.820169925689697', 'университете 6.496365070343018', 'открыли 5.944686412811279', 'новую 5.83096170425415', 'экспозицию 5.348505020141602', '« 0.0', 'Они 0.0', 'стояли 5.839313507080078', 'у 5.27208137512207', 'истоков 5.396897315979004', 'ульяновской 6.163735866546631', 'науки 6.097537994384766', '» 0.0', '. 0.0', 'Она 0.0', 'посвящена 6.200736999511719', 'выдающимся 5.983486175537109', 'ученым 5.918715000152588', 'XX 0.0', '– 0.0', 'начала 5.412599086761475', 'XXI 0.0', 'веков 5.634378910064697', ', 0.0', 'чьи 5.3263936042785645', 'достижения 5.957095623016357', 'и 3.6682209968566895', 'изобретения 5.965053081512451', 'сформировали 5.379479885101318', 'научный 5.801889419555664', 'и 3.6682209968566895', 'технологический 5.863436222076416', 'облик 5.85204553604126', 'Ульяновской 0.0', 'области 6.532813549041748', '. 0.0']\n",
|
|||
|
"['В 0.0', 'рамках 5.911857604980469', 'торжественного 5.679202556610107', 'открытия 6.117740154266357', 'экспозиции 5.659939765930176', 'за 5.40718412399292', 'активное 5.857132911682129', 'участие 6.153576374053955', 'в 5.39321756362915', 'подготовке 6.464332103729248', 'материалов 6.255794048309326', 'были 6.02486515045166', 'вручены 5.769386291503906', 'благодарственные 6.248109340667725', 'письма 6.082029342651367', 'профессору 5.6763691902160645', 'кафедры 6.288632392883301', 'летной 6.138020992279053', 'эксплуатации 6.849708557128906', 'и 3.6682209968566895', 'безопасности 6.110499382019043', 'полетов 6.38402795791626', 'Ульяновского 0.0', 'института 6.3007354736328125', 'гражданской 6.207894802093506', 'авиации 6.574370861053467', 'имени 5.757457256317139', 'Главного 0.0', 'маршала 6.116937160491943', 'авиации 6.574370861053467', 'Б.П.Бугаева 0.0', 'Сергею 0.0', 'Косачевскому 0.0', ', 0.0', 'начальнику 6.195803165435791', 'управления 6.50614070892334', 'научно 6.042668342590332', '- 0.0', 'исследовательской 5.603431224822998', 'и 3.6682209968566895', 'инновационной 6.336301803588867', 'деятельности 6.13490104675293', 'Ульяновского 0.0', 'государственного 6.192478656768799', 'педагогического 6.430647373199463', 'университета 6.620550632476807', 'имени 5.757457256317139', 'И.Н.Ульянова 0.0', 'Светлане 0.0', 'Богатовой 0.0', ', 0.0', 'руководителю 5.767575740814209', 'пресс 5.993161201477051', '- 0.0', 'службы 6.049290657043457', 'Ульяновского 0.0', 'государственного 6.192478656768799', 'аграрного 6.049352169036865', 'университета 6.620550632476807', 'имени 5.757457256317139', 'П.А.Столыпина 0.0', 'Винере 0.0', 'Насыровой 0.0', ', 0.0', 'помощнику 5.360434532165527', 'проректора 5.608104228973389', 'по 5.825357437133789', 'научной 5.987777233123779', 'работе 5.538445949554443', 'Ульяновского 0.0', 'государственного 6.192478656768799', 'университета 6.620550632476807', 'Татьяне 0.0', 'Лисовой 0.0', '. 0.0']\n",
|
|||
|
"0.7208023892475828\n",
|
|||
|
"0.2978442641202898\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print([f\"{token.text} {token.vector_norm}\" for token in sp(text)])\n",
|
|||
|
"print([f\"{token.text} {token.vector_norm}\" for token in sp(text_ner)])\n",
|
|||
|
"\n",
|
|||
|
"print(sp(text).similarity(sp(text_ner)))\n",
|
|||
|
"print(sp(\"Мама мыла раму\").similarity(sp(\"Биологический родитель выполнял очистку каркаса окна\")))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Пример анализа текстов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Загрузка данных из документов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Столбец type позволяет использовать методы обучения с учителем для построения классификатора"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 28,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|||
|
"Index: 41 entries, 0 to 40\n",
|
|||
|
"Data columns (total 3 columns):\n",
|
|||
|
" # Column Non-Null Count Dtype \n",
|
|||
|
"--- ------ -------------- ----- \n",
|
|||
|
" 0 doc 41 non-null object\n",
|
|||
|
" 1 text 41 non-null object\n",
|
|||
|
" 2 type 41 non-null int64 \n",
|
|||
|
"dtypes: int64(1), object(2)\n",
|
|||
|
"memory usage: 1.3+ KB\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16</th>\n",
|
|||
|
" <td>tz_01.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19</th>\n",
|
|||
|
" <td>tz_02.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>28</th>\n",
|
|||
|
" <td>tz_03.docx</td>\n",
|
|||
|
" <td>2.2. Техническое задание\\nОбщие сведения:\\nВ д...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>35</th>\n",
|
|||
|
" <td>tz_04.docx</td>\n",
|
|||
|
" <td>Техническое задание\\n2.2.1 Общие сведения\\nИнт...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>38</th>\n",
|
|||
|
" <td>tz_05.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения....</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc text type\n",
|
|||
|
"16 tz_01.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0\n",
|
|||
|
"19 tz_02.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0\n",
|
|||
|
"28 tz_03.docx 2.2. Техническое задание\\nОбщие сведения:\\nВ д... 0\n",
|
|||
|
"35 tz_04.docx Техническое задание\\n2.2.1 Общие сведения\\nИнт... 0\n",
|
|||
|
"38 tz_05.docx 2.2 Техническое задание\\n2.2.1 Общие сведения.... 0"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25</th>\n",
|
|||
|
" <td>Этапы разработки проекта2.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: заключительные стади...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21</th>\n",
|
|||
|
" <td>Этапы разработки проекта3.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: определение стратеги...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>40</th>\n",
|
|||
|
" <td>Этапы разработки проекта4.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: реализация, тестиров...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>30</th>\n",
|
|||
|
" <td>Этапы разработки проекта5.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: стратегия и анализ\\n...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>22</th>\n",
|
|||
|
" <td>Язык манипуляции данными.docx</td>\n",
|
|||
|
" <td>2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc \\\n",
|
|||
|
"25 Этапы разработки проекта2.docx \n",
|
|||
|
"21 Этапы разработки проекта3.docx \n",
|
|||
|
"40 Этапы разработки проекта4.docx \n",
|
|||
|
"30 Этапы разработки проекта5.docx \n",
|
|||
|
"22 Язык манипуляции данными.docx \n",
|
|||
|
"\n",
|
|||
|
" text type \n",
|
|||
|
"25 Этапы разработки проекта: заключительные стади... 1 \n",
|
|||
|
"21 Этапы разработки проекта: определение стратеги... 1 \n",
|
|||
|
"40 Этапы разработки проекта: реализация, тестиров... 1 \n",
|
|||
|
"30 Этапы разработки проекта: стратегия и анализ\\n... 1 \n",
|
|||
|
"22 2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма... 1 "
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"from docx import Document\n",
|
|||
|
"import os\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"def read_docx(file_path):\n",
|
|||
|
" doc = Document(file_path)\n",
|
|||
|
" full_text = []\n",
|
|||
|
" for paragraph in doc.paragraphs:\n",
|
|||
|
" full_text.append(paragraph.text)\n",
|
|||
|
" return \"\\n\".join(full_text)\n",
|
|||
|
"\n",
|
|||
|
"def load_docs(dataset_path):\n",
|
|||
|
" df = pd.DataFrame(columns=[\"doc\", \"text\"])\n",
|
|||
|
" for file_path in os.listdir(dataset_path):\n",
|
|||
|
" if file_path.startswith(\"~$\"):\n",
|
|||
|
" continue\n",
|
|||
|
" text = read_docx(dataset_path + file_path)\n",
|
|||
|
" df.loc[len(df.index)] = [file_path, text]\n",
|
|||
|
" return df\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"df = load_docs(\"data/text/\")\n",
|
|||
|
"df[\"type\"] = df.apply(\n",
|
|||
|
" lambda row: 0 if str(row[\"doc\"]).startswith(\"tz_\") else 1, axis=1\n",
|
|||
|
")\n",
|
|||
|
"df.info()\n",
|
|||
|
"df.sort_values(by=[\"doc\"], inplace=True)\n",
|
|||
|
"\n",
|
|||
|
"display(df.head(), df.tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Векторизация документов в виде мешка слов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 29,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" <th>vector</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16</th>\n",
|
|||
|
" <td>tz_01.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19</th>\n",
|
|||
|
" <td>tz_02.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>28</th>\n",
|
|||
|
" <td>tz_03.docx</td>\n",
|
|||
|
" <td>2.2. Техническое задание\\nОбщие сведения:\\nВ д...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>35</th>\n",
|
|||
|
" <td>tz_04.docx</td>\n",
|
|||
|
" <td>Техническое задание\\n2.2.1 Общие сведения\\nИнт...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>38</th>\n",
|
|||
|
" <td>tz_05.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения....</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc text type \\\n",
|
|||
|
"16 tz_01.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0 \n",
|
|||
|
"19 tz_02.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0 \n",
|
|||
|
"28 tz_03.docx 2.2. Техническое задание\\nОбщие сведения:\\nВ д... 0 \n",
|
|||
|
"35 tz_04.docx Техническое задание\\n2.2.1 Общие сведения\\nИнт... 0 \n",
|
|||
|
"38 tz_05.docx 2.2 Техническое задание\\n2.2.1 Общие сведения.... 0 \n",
|
|||
|
"\n",
|
|||
|
" vector \n",
|
|||
|
"16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"19 [0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"28 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"35 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"38 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... "
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" <th>vector</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25</th>\n",
|
|||
|
" <td>Этапы разработки проекта2.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: заключительные стади...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21</th>\n",
|
|||
|
" <td>Этапы разработки проекта3.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: определение стратеги...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>40</th>\n",
|
|||
|
" <td>Этапы разработки проекта4.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: реализация, тестиров...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>30</th>\n",
|
|||
|
" <td>Этапы разработки проекта5.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: стратегия и анализ\\n...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>22</th>\n",
|
|||
|
" <td>Язык манипуляции данными.docx</td>\n",
|
|||
|
" <td>2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc \\\n",
|
|||
|
"25 Этапы разработки проекта2.docx \n",
|
|||
|
"21 Этапы разработки проекта3.docx \n",
|
|||
|
"40 Этапы разработки проекта4.docx \n",
|
|||
|
"30 Этапы разработки проекта5.docx \n",
|
|||
|
"22 Язык манипуляции данными.docx \n",
|
|||
|
"\n",
|
|||
|
" text type \\\n",
|
|||
|
"25 Этапы разработки проекта: заключительные стади... 1 \n",
|
|||
|
"21 Этапы разработки проекта: определение стратеги... 1 \n",
|
|||
|
"40 Этапы разработки проекта: реализация, тестиров... 1 \n",
|
|||
|
"30 Этапы разработки проекта: стратегия и анализ\\n... 1 \n",
|
|||
|
"22 2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма... 1 \n",
|
|||
|
"\n",
|
|||
|
" vector \n",
|
|||
|
"25 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"21 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"40 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"30 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"22 [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... "
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"counts_vectorizer = CountVectorizer()\n",
|
|||
|
"counts_matrix = sparse.csr_matrix(counts_vectorizer.fit_transform(df[\"text\"]))\n",
|
|||
|
"words = counts_vectorizer.get_feature_names_out()\n",
|
|||
|
"df[\"vector\"] = df.apply(lambda row: counts_matrix.toarray()[row.name], axis=1)\n",
|
|||
|
"\n",
|
|||
|
"display(df.head(), df.tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Вывод термов с частотой больше threshold для каждого документа"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 30,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" <th>vector</th>\n",
|
|||
|
" <th>top_terms</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>16</th>\n",
|
|||
|
" <td>tz_01.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>информации, требования</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>19</th>\n",
|
|||
|
" <td>tz_02.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения\\...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>анализа, на, по, предприятия, работ, системы, ...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>28</th>\n",
|
|||
|
" <td>tz_03.docx</td>\n",
|
|||
|
" <td>2.2. Техническое задание\\nОбщие сведения:\\nВ д...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>данных, для, записей, знаний, из, или, ко, мно...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>35</th>\n",
|
|||
|
" <td>tz_04.docx</td>\n",
|
|||
|
" <td>Техническое задание\\n2.2.1 Общие сведения\\nИнт...</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>дата, для, заказ, заказа, информации, код, мат...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>38</th>\n",
|
|||
|
" <td>tz_05.docx</td>\n",
|
|||
|
" <td>2.2 Техническое задание\\n2.2.1 Общие сведения....</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>apache, cgi, html, iis, java, jdbc, linux, net...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc text type \\\n",
|
|||
|
"16 tz_01.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0 \n",
|
|||
|
"19 tz_02.docx 2.2 Техническое задание\\n2.2.1 Общие сведения\\... 0 \n",
|
|||
|
"28 tz_03.docx 2.2. Техническое задание\\nОбщие сведения:\\nВ д... 0 \n",
|
|||
|
"35 tz_04.docx Техническое задание\\n2.2.1 Общие сведения\\nИнт... 0 \n",
|
|||
|
"38 tz_05.docx 2.2 Техническое задание\\n2.2.1 Общие сведения.... 0 \n",
|
|||
|
"\n",
|
|||
|
" vector \\\n",
|
|||
|
"16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"19 [0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"28 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"35 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"38 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"\n",
|
|||
|
" top_terms \n",
|
|||
|
"16 информации, требования \n",
|
|||
|
"19 анализа, на, по, предприятия, работ, системы, ... \n",
|
|||
|
"28 данных, для, записей, знаний, из, или, ко, мно... \n",
|
|||
|
"35 дата, для, заказ, заказа, информации, код, мат... \n",
|
|||
|
"38 apache, cgi, html, iis, java, jdbc, linux, net... "
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>doc</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" <th>vector</th>\n",
|
|||
|
" <th>top_terms</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>25</th>\n",
|
|||
|
" <td>Этапы разработки проекта2.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: заключительные стади...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>2010, на, по, программы</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>21</th>\n",
|
|||
|
" <td>Этапы разработки проекта3.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: определение стратеги...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>данных, для, на, ошибки, по, работ, системы, т...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>40</th>\n",
|
|||
|
" <td>Этапы разработки проекта4.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: реализация, тестиров...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>create, doctors, sql, бд, данных, для, на, соз...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>30</th>\n",
|
|||
|
" <td>Этапы разработки проекта5.docx</td>\n",
|
|||
|
" <td>Этапы разработки проекта: стратегия и анализ\\n...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>быть, должна, система, системы</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>22</th>\n",
|
|||
|
" <td>Язык манипуляции данными.docx</td>\n",
|
|||
|
" <td>2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма...</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
|
|||
|
" <td>быть, данных, для, должен, должна, должны, инф...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" doc \\\n",
|
|||
|
"25 Этапы разработки проекта2.docx \n",
|
|||
|
"21 Этапы разработки проекта3.docx \n",
|
|||
|
"40 Этапы разработки проекта4.docx \n",
|
|||
|
"30 Этапы разработки проекта5.docx \n",
|
|||
|
"22 Язык манипуляции данными.docx \n",
|
|||
|
"\n",
|
|||
|
" text type \\\n",
|
|||
|
"25 Этапы разработки проекта: заключительные стади... 1 \n",
|
|||
|
"21 Этапы разработки проекта: определение стратеги... 1 \n",
|
|||
|
"40 Этапы разработки проекта: реализация, тестиров... 1 \n",
|
|||
|
"30 Этапы разработки проекта: стратегия и анализ\\n... 1 \n",
|
|||
|
"22 2.1.3. Язык манипуляции данными (ЯМД)\\nЯзык ма... 1 \n",
|
|||
|
"\n",
|
|||
|
" vector \\\n",
|
|||
|
"25 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"21 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"40 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"30 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"22 [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
|
|||
|
"\n",
|
|||
|
" top_terms \n",
|
|||
|
"25 2010, на, по, программы \n",
|
|||
|
"21 данных, для, на, ошибки, по, работ, системы, т... \n",
|
|||
|
"40 create, doctors, sql, бд, данных, для, на, соз... \n",
|
|||
|
"30 быть, должна, система, системы \n",
|
|||
|
"22 быть, данных, для, должен, должна, должны, инф... "
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"def get_terms(doc_name, words, threshold=0):\n",
|
|||
|
" important_words = [\n",
|
|||
|
" word\n",
|
|||
|
" for word, score in zip(words, df.iloc[doc_name][\"vector\"])\n",
|
|||
|
" if score > threshold\n",
|
|||
|
" ]\n",
|
|||
|
" return \", \".join(important_words)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"df[\"top_terms\"] = df.apply(lambda row: get_terms(row.name, words, threshold=7), axis=1)\n",
|
|||
|
"\n",
|
|||
|
"display(df.head(), df.tail())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"#### Четкая неиерархическая кластеризация документов"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 31,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Кластер 1 (30):\n",
|
|||
|
"tz_01.docx; tz_02.docx; tz_03.docx; tz_06.docx; tz_07.docx; tz_08.docx; tz_10.docx; tz_11.docx; tz_14.docx; tz_16.docx; tz_17.docx; tz_20.docx; Архитектура, управляемая модель.docx; Введение в проектирование ИС.docx; Встроенные операторы SQL.docx; Методологии разработки программного обеспечения 2.docx; Методологии разработки программного обеспечения.docx; Методы композиции и декомпозиции.docx; Модели представления данных в СУБД.docx; Некоторые особенности проектирования.docx; Непроцедурный доступ к данным.docx; Процедурное расширение языка SQL.docx; Системные объекты базы данных.docx; Технология создания распр ИС.docx; Требования к проекту.docx; Условия целостности БД.docx; Характеристики СУБД.docx; Этапы разработки проекта1.docx; Этапы разработки проекта5.docx; Язык манипуляции данными.docx\n",
|
|||
|
"--------\n",
|
|||
|
"Кластер 2 (11):\n",
|
|||
|
"tz_04.docx; tz_05.docx; tz_09.docx; tz_12.docx; tz_13.docx; tz_15.docx; tz_18.docx; tz_19.docx; Этапы разработки проекта2.docx; Этапы разработки проекта3.docx; Этапы разработки проекта4.docx\n",
|
|||
|
"--------\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.cluster import KMeans\n",
|
|||
|
"import numpy\n",
|
|||
|
"\n",
|
|||
|
"num_clusters = 2\n",
|
|||
|
"kmeans = KMeans(n_clusters=num_clusters, random_state=9)\n",
|
|||
|
"kmeans.fit(sparse.csr_matrix(list(df[\"vector\"])))\n",
|
|||
|
"\n",
|
|||
|
"for cluster_id in range(num_clusters):\n",
|
|||
|
" cluster_indices = numpy.where(kmeans.labels_ == cluster_id)[0]\n",
|
|||
|
" print(f\"Кластер {cluster_id + 1} ({len(cluster_indices)}):\")\n",
|
|||
|
" cluster_docs = [df.iloc[idx][\"doc\"] for idx in cluster_indices]\n",
|
|||
|
" print(\"; \".join(cluster_docs))\n",
|
|||
|
" print(\"--------\")"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": ".venv",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.12.9"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 2
|
|||
|
}
|