第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

<abbr id="hs0bd"></abbr>

<code id="hs0bd"><wbr id="hs0bd"></wbr></code>

我的購(gòu)物車

已加入門課程

購(gòu)物車?yán)锟湛杖缫?/h3>
快去這里選購(gòu)你中意的課程

實(shí)戰(zhàn)課

體系課

我的訂單中心

去購(gòu)物車

全部開(kāi)發(fā)者教程

TensorFlow 入門教程

TensorFlow 簡(jiǎn)介、安裝與快速入門

TensorFlow 簡(jiǎn)介 TensorFlow 安裝 - CPU TensorFlow 安裝 - GPU TensorFlow 快速入門示例

TensorFlow 模型的簡(jiǎn)潔表示-Keras

Keras 簡(jiǎn)介使用 tf.keras 進(jìn)行圖片分類使用 Keras 進(jìn)行文本分類使用 Keras 進(jìn)行回歸在 Keras 中保存與加載模型在 Keras 中進(jìn)行模型的評(píng)估 Keras 中的Masking 與 Padding

TensorFlow 中的數(shù)據(jù)格式

TensorFlow 中的數(shù)據(jù)核心使用 TensorFlow 加載 CSV 數(shù)據(jù) 使用 TensorFlow 加載 Numpy 數(shù)據(jù) 使用 TF 加載 DateFrame 數(shù)據(jù) 使用圖像數(shù)據(jù)來(lái)訓(xùn)練模型在 TensorFlow 之中使用文本數(shù)據(jù) TF 之中的 Unicode 數(shù)據(jù)格式的處理

TensorFlow模型的高級(jí)表示-Estimat

使用預(yù)設(shè)的 Estimator 模型將Keras模型轉(zhuǎn)化為Estimator模型 Estimator實(shí)現(xiàn)BoostingTree模型

TensorFlow 高級(jí)技巧

過(guò)擬合問(wèn)題 TensorFlow 中的回調(diào)函數(shù) 文本數(shù)據(jù)嵌入在 TensorFlow 之中使用卷積神經(jīng)網(wǎng)絡(luò) 在 TensorFlow 之中使用循環(huán)神經(jīng)網(wǎng)絡(luò) 在 TensorFlow 之中使用注意力模型在 TensorFlow 之中進(jìn)行遷移學(xué)習(xí) 在 TensorFlow 之中進(jìn)行數(shù)據(jù)增強(qiáng) 在 TensorFlow 之中進(jìn)行圖像分割如何進(jìn)行多 GPU 的分布式訓(xùn)練？使用 tf.function 提升效率使用 TF HUB 進(jìn)行模型復(fù)用

TensorFlow高級(jí)技巧-自定義

使用 TensorFlow 進(jìn)行微分操作在 TensorFlow 之中自定義網(wǎng)絡(luò)層與模型在 TensorFlow 之中自定義訓(xùn)練

TF 框架中的可視化工具-TensorBoard

TensorBoard 的簡(jiǎn)介與快速上手使用 TensorBoard 記錄訓(xùn)練中的各項(xiàng)指標(biāo) 在 TensorBoard 之中查看模型結(jié)構(gòu)圖在 TensorBoard 之中記錄圖片數(shù)據(jù)

首頁(yè) 慕課教程 TensorFlow 入門教程 TF 之中的 Unicode 數(shù)據(jù)格式的處理

夜流歌 · 更新于 2020-10-19

上一節(jié)

在 TensorFlow 之中使用文本數(shù)據(jù)

使用預(yù)設(shè)的 Estimator 模型

下一節(jié)

TensorFlow 之中的 Unicode 數(shù)據(jù)格式的處理

在我們之前的數(shù)據(jù)處理過(guò)程之中，我們都是采用的 ASCII 碼或者其他編碼處理數(shù)據(jù)格式的，但是這些編碼并不能夠完全表示當(dāng)前所有語(yǔ)言的所有字符，比如我們就無(wú)法使用 ASCII 碼來(lái)表示漢語(yǔ)。因此這個(gè)時(shí)候我們就需要用到一種新的編碼方式來(lái)進(jìn)行字符的處理，于是這節(jié)課我們來(lái)學(xué)習(xí)如何在 TensorFlow 之中處理 Unicode 格式的數(shù)據(jù)。

1. 在 TensorFlow之中創(chuàng)建 Unicode 字符串以及張量

在 TensorFlow 之中，Unicode是存儲(chǔ)在 tf.string 數(shù)據(jù)類型之中的，而在默認(rèn)的情況之下，Unicode在 TensorFlow 之中的默認(rèn)的編碼格式是 UTF-8 編碼的，我們可以通過(guò)以下示例查看具體的細(xì)節(jié)：

ch_string = u"你好呀！"
en_string = u"Hello"

ch_string_utf_8 = tf.constant(ch_string)
en_string_utf_8 = tf.constant(en_string)
print(ch_string_utf_8, en_string_utf_8, sep='\n')

在這段代碼之中，我們著重進(jìn)行了以下的操作：

在字符串前加上了 u，從而指示該字符串為 Unicode 格式；
我們使用 tf.constant 函數(shù)來(lái)將字符串轉(zhuǎn)化為 Tensor 張量。

我們可以得到如下輸出：

tf.Tensor(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x91\x80\xef\xbc\x81', shape=(), dtype=string)
tf.Tensor(b'Hello', shape=(), dtype=string)

我們可以發(fā)現(xiàn)以下幾點(diǎn)：

這兩個(gè) Tensor 的數(shù)據(jù)類型都為 string ，這其實(shí)是 TensorFlow 內(nèi)部的 tf.string 數(shù)據(jù)類型；
這兩個(gè)Tensor 的 Shape 都為空，因?yàn)樵?TensoFlow 之中不會(huì)為 Unicode 字符串賦予形狀，這是因?yàn)樽址拈L(zhǎng)度不盡相同；
第一個(gè)中文的字符串被按照 UTF-8 規(guī)則進(jìn)行了編碼，而英文并沒(méi)有進(jìn)行編碼（嚴(yán)格來(lái)說(shuō)，英文也進(jìn)行了編碼，只是編碼前后相同，這一點(diǎn)可以由字符串前面的b就可以看出）。

2. TensorFlow 之中 Unicode 字符串的存在形式

在TensorFlow之中， Unicode 字符串有兩種表現(xiàn)形式，它們分別是：

編碼格式：使用編碼規(guī)則進(jìn)行編碼后的字符串，比如 UTF-8、UTF-16 等編碼方式；
解碼格式：對(duì)于每一個(gè)字符按照唯一的整數(shù)進(jìn)行編碼之后的格式，這些整數(shù)被稱作“代碼點(diǎn)”。

在第一小節(jié)之中我們看到的形式就是編碼格式，而且編碼方式為 UTF-8，對(duì)于兩種格式，我們可以通過(guò) tf.strings.unicode_decode 以及 tf.strings.unicode_encode 進(jìn)行相應(yīng)的轉(zhuǎn)化，比如以下示例：

ch_string_utf_8_decode = tf.strings.unicode_decode(ch_string_utf_8, input_encoding='UTF-8')
ch_string_utf_8_encode = tf.strings.unicode_encode(ch_string_utf_8_decode, output_encoding='UTF-8')
print(ch_string_utf_8_decode)
print(ch_string_utf_8_encode)

在這 tf.strings.unicode_decode 函數(shù)之中，包含兩個(gè)參數(shù)：

第一個(gè)參數(shù)就是我們要進(jìn)行解碼的字符串，比如我們的 ch_string_utf_8 ；
第二個(gè)參數(shù)是輸入字符串的編碼格式，因?yàn)槲覀兊淖址幋a格式為 UTF-8 ，因此在這里我們的參數(shù)為input_encoding=‘UTF-8’。

tf.strings.unicode_encode 函數(shù)與 tf.strings.unicode_decode 函數(shù)相似，只是第二個(gè)參數(shù)是輸出字符串的編碼方式，因?yàn)槲覀冃枰?UTF-8 編碼的格式，因此這里我們選擇 output_encoding=‘UTF-8’。

我們可以得到輸出:

tf.Tensor([20320 22909 21568 65281], shape=(4,), dtype=int32)
tf.Tensor(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x91\x80\xef\xbc\x81', shape=(), dtype=string)

我們發(fā)現(xiàn)解碼后的字符串就是一串整數(shù)數(shù)組，其中的每個(gè)整數(shù)代表著一個(gè)中文字符；于此同時(shí)，更重要的是解碼產(chǎn)生的數(shù)組是擁有形狀的，而正因如此，解碼后的表示更加適合我們用作數(shù)據(jù)集。

同時(shí)我們也可以發(fā)現(xiàn) ch_string_utf_8_encode 與 ch_string_utf_8 兩個(gè)完全一樣，因?yàn)?ch_string_utf_8 本來(lái)就是編碼的字符串嘛。

3. 單個(gè) Unicode 字符串的處理

無(wú)論 Unicode 格式怎么編碼，Unicode 字符串終歸是字符串，因此在實(shí)際應(yīng)用之中就會(huì)進(jìn)行各種的字符串操作，因此我們有必要來(lái)學(xué)習(xí)一下在 TensorFlow 之中的 Unicode 字符串的基本處理操作。

3.1 如何獲取 Unicode 字符串的長(zhǎng)度

我們可以使用 tf.strings.length 函數(shù)來(lái)獲取 Unicode 字符串的長(zhǎng)度，該函數(shù)含有兩個(gè)重要的參數(shù)：

str，要獲取長(zhǎng)度的字符串；
unit，長(zhǎng)度的單位，目前包含兩個(gè)選項(xiàng)，一個(gè)是“BYTE”，另一個(gè)是“UTF8_CHAR”：
- BYTE，按照字節(jié)進(jìn)行計(jì)數(shù)，從而獲取字符串的長(zhǎng)度；
- UTF8_CHAR，按照單個(gè) Unicode 字符的單位進(jìn)行計(jì)數(shù)，獲取我們通常認(rèn)知的長(zhǎng)度。

同時(shí)該 API 返回的是一個(gè) Tensor ，我們可以通過(guò) numpy() 函數(shù)來(lái)將其轉(zhuǎn)化為我們可以直接使用的數(shù)字長(zhǎng)度。

比如以下代碼：

len_bytes = tf.strings.length(ch_string_utf_8, unit='BYTE')
len_chars = tf.strings.length(ch_string_utf_8, unit='UTF8_CHAR')
print(len_bytes, len_chars)
print(len_bytes.numpy(), len_chars.numpy())

我們可以得到如下輸出：

tf.Tensor(12, shape=(), dtype=int32) tf.Tensor(4, shape=(), dtype=int32)
12 4

可以看到，“你好呀！”字符串含有 12 個(gè)字節(jié)長(zhǎng)度，而且正如我們看到的那樣，包含 4 個(gè)漢字字符。

3.2 子字符串的操作

對(duì)于 Unicode 子字符串的操作，我們可以通過(guò) tf.strings.substr 函數(shù)來(lái)實(shí)現(xiàn)，該 API 接收 4 個(gè)參數(shù)，它們分別是：

str，要進(jìn)行子字符串操作的 Unicode 字符串；
unit，與前面的 unit 一樣，表示截取的單位，包含“BYTE”以及“UTF8_CHAR”兩個(gè)選項(xiàng)；
pos，開(kāi)始截取的位置；
len，截取的長(zhǎng)度。

我們可以通過(guò)以下示例進(jìn)行查看：

print(tf.strings.substr(ch_string_utf_8, pos=3, len=1, unit='BYTE'))
print(tf.strings.substr(ch_string_utf_8, pos=3, len=1, unit='UTF8_CHAR'))

我們可以得到如下輸出：

tf.Tensor(b'\xe5', shape=(), dtype=string)
tf.Tensor(b'\xef\xbc\x81', shape=(), dtype=string)

我們可以發(fā)現(xiàn)，b’\xe5 剛剛好是 3 位置的字符串，而 b’\xef\xbc\x81’ 剛剛好是最后一個(gè)“！”的 Unicode 表示。

3.3 字符串的拆分

通過(guò)拆分操作，我們可以將每個(gè)Unicode字符進(jìn)行拆分，從而形成一個(gè)數(shù)組，每個(gè)數(shù)組包含一個(gè) Unicode 字符的編碼。

對(duì)于該操作，我們可以通過(guò) tf.strings.unicode_split 函數(shù)實(shí)現(xiàn)，該函數(shù)的具體使用如下：

print(tf.strings.unicode_split(ch_string_utf_8, 'UTF-8'))

其中的第二個(gè)參數(shù)表示的是字符串的編碼方式，我們可以得到如下輸出：

tf.Tensor([b'\xe4\xbd\xa0' b'\xe5\xa5\xbd' b'\xe5\x91\x80' b'\xef\xbc\x81'], shape=(4,), dtype=string)

我們看到，我們的字符串已經(jīng)成功進(jìn)行了拆分的基本操作。

4. 使用 Unicode 數(shù)據(jù)構(gòu)造數(shù)據(jù)集的示例

在實(shí)際的使用之中，我們大致分為以下幾步來(lái)構(gòu)造 Unicode 字符串的數(shù)據(jù)集：

首先將 Unicode 字符串?dāng)?shù)據(jù)進(jìn)行解碼，因?yàn)檫@樣就可以計(jì)算長(zhǎng)度；
將其統(tǒng)一為定長(zhǎng)的形式；
構(gòu)造數(shù)據(jù)集

對(duì)于解碼，我們可以通過(guò)之前的 tf.strings.unicode_decode 函數(shù)進(jìn)行解碼，我們可以通過(guò)下面的示例查看解碼的結(jié)果：

data_string = [u"你好呀", u"很高興認(rèn)識(shí)你", u"Hello", u"Nice to meet you"]

decode_data = tf.strings.unicode_decode(data_string, input_encoding='UTF-8')
print(decode_data, decode_data.shape, sep='\n')

我們可以得到的輸出為：

<tf.RaggedTensor [[20320, 22909, 21568], [24456, 39640, 20852, 35748, 35782, 20320], [72, 101, 108, 108, 111], [78, 105, 99, 101, 32, 116, 111, 32, 109, 101, 101, 116, 32, 121, 111, 117]]>
(4, None)

可以發(fā)現(xiàn)，我們得到的數(shù)據(jù)為 tf.RaggedTensor 格式，而這種格式的每個(gè)元素都不是定長(zhǎng)的，而這就到十六我們的數(shù)據(jù)的 shape 只能為（4, None），因此我們可以通過(guò)to_tensor()函數(shù)來(lái)將其轉(zhuǎn)化為定長(zhǎng)的張量。

decode_data_pad = decode_data.to_tensor()

print(decode_data_pad, decode_data_pad.shape,  sep='\n')

我們可以得到如下結(jié)果:

tf.Tensor(
[[20320 22909 21568     0     0     0     0     0     0     0     0     0
      0     0     0     0]
 [24456 39640 20852 35748 35782 20320     0     0     0     0     0     0
      0     0     0     0]
 [   72   101   108   108   111     0     0     0     0     0     0     0
      0     0     0     0]
 [   78   105    99   101    32   116   111    32   109   101   101   116
     32   121   111   117]], shape=(4, 16), dtype=int32)
     
(4, 16)

由此我們可以發(fā)現(xiàn)，我們的數(shù)據(jù)已經(jīng) Padding 到了統(tǒng)一的長(zhǎng)度，而這個(gè)長(zhǎng)度是根據(jù)最長(zhǎng)的字符串的長(zhǎng)度來(lái)決定的。這樣之后，我們便可以進(jìn)一步構(gòu)造數(shù)據(jù)集，我們將會(huì)采用定長(zhǎng)與不定長(zhǎng)的數(shù)據(jù)分別構(gòu)造數(shù)據(jù)集，來(lái)查看兩者的區(qū)別。

在這里我們可以使用虛擬的標(biāo)簽進(jìn)行操作，我們依然使用傳統(tǒng)的 tf.data.Dataset.from_tensor_slices 函數(shù)來(lái)進(jìn)行數(shù)據(jù)集的構(gòu)建：

labels = [0, 0, 0, 0]
dataset = tf.data.Dataset.from_tensor_slices((decode_data, labels))
dataset_pad = tf.data.Dataset.from_tensor_slices((decode_data_pad, labels))
print(dataset)
print(dataset_pad)

我們可以得到結(jié)果：

<TensorSliceDataset shapes: ((None,), ()), types: (tf.int32, tf.int32)>
<TensorSliceDataset shapes: ((16,), ()), types: (tf.int32, tf.int32)>

我們可以看到，沒(méi)有采用 Padding 的數(shù)據(jù)集的形狀為 ((None,), ()) ，而采用了 Padding 數(shù)據(jù)集的形狀為((16,), ())，而后者是會(huì)對(duì)我們的使用有利的，因此我們推薦使用后者進(jìn)行操作。

5. 小結(jié)

在這節(jié)課之中，我們學(xué)習(xí)了如何在 TensorFlow 之中使用 Unicode 字符串，我們同時(shí)學(xué)習(xí)了 Unicode 字符串的兩種存在形式，又了解了 Unicode 字符串的基本操作，最后我們通過(guò)一個(gè)簡(jiǎn)單的示例了解了如何使用 Unicode 字符串構(gòu)造數(shù)據(jù)集。

圖片描述

上一節(jié)

在 TensorFlow 之中使用文本數(shù)據(jù)

下一節(jié)

使用預(yù)設(shè)的 Estimator 模型

我要提出意見(jiàn)反饋

索引目錄

TensorFlow 之中的 Unicode 數(shù)據(jù)格式的處理

1. 在 TensorFlow之中創(chuàng)建 Unicode 字符串以及張量

2. TensorFlow 之中 Unicode 字符串的存在形式

3. 單個(gè) Unicode 字符串的處理

3.1 如何獲取 Unicode 字符串的長(zhǎng)度

3.2 子字符串的操作

3.3 字符串的拆分

4. 使用 Unicode 數(shù)據(jù)構(gòu)造數(shù)據(jù)集的示例

5. 小結(jié)

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

<samp id="h0wac"></samp>