第22章 扩展示例:Web客户端编程 至此,您已经看到了如何与数据库交互,解析一些数据,以及处理错误。现在让我们更进了一步,引入Web客户端库的组合。
在本章,我们将开发一个真正的应用程序:一个播客下载软件,或播客采集软件。一个播客采集软件的想法很简单。给出一系列的URL。下载每个URL并将结果存到RSS格式的XML文件中。在这个XML文件里,我们会找到用于下载音频文件的URL。
播客采集软件通常是让用户通过往他们的配置里添加RSS URL来订阅播客。然后,用户可以定期执行更新操作。播客采集软件将下载RSS文件,检查它们的音频文件引用,并下载该用户之前没有下载过的音频文件。 %用户经常访问播客的RSS文档,并在播放清单中包括每个独立的音频文件。 要做到这一点,我们需要有这几件东西:
•一个用于下载文件的HTTP客户端库
•一个XML解析器
•持久存储我们感兴趣的播客的方法
•持久存储哪些播客节目我们已经下载的方法
最后两个项目可以通过数据库实现,我们将使用HDBC。前两个可以通过本章将会介绍的其他库模块实现。 %本章中的代码是专门为这本书所写,但是是基于为hpodder所写的代码,hpodder是一个用haskell所写的播客采集软件。 hpodder有比在这里展示的例子更多的特性,这使得它太长太复杂难以在这本书中展现。如果您有兴趣学习hpodder,它的源代码是免费提供的,http://software.complete.org/hpodder。 我们将为本章件编写的代码分成几块。每块都会是它自己的Haskell模块。您将能够在ghci中使用它的每一部分。最后,我们会将所有代码整合成完整的应用程序。我们将从将会使用的基本类型开始。 基本类型 要做的第一件事是要对应用程序的重要基本信息有一些概念。这通常会是关于用户感兴趣的播客以及我们已经看过的播放清单的有关信息。如有需要这是很容易修变的,但由于我们将会在所有地方import它,我们将首先定义它: -- file: ch22/PodTypes.hs
module PodTypes where
data Podcast =
Podcast {castId :: Integer, -- ^ Numeric ID for this podcast
castURL :: String -- ^ Its feed URL
}
deriving (Eq, Show, Read) data Episode =
Episode {epId :: Integer, -- ^ Numeric ID for this episode
epCast :: Podcast, -- ^ The ID of the podcast it came from
epURL :: String, -- ^ The download URL for this episode
epDone :: Bool -- ^ Whether or not we are done with this ep
}
deriving (Eq, Show, Read) 我们将会在数据库中存储这些信息。对于播客和播放清单来说有一个唯一的标识符将可以很容易地发现哪个播放清单属于哪个特定的播客,更容易地为特定的播客或清单载入信息以及处理如改变播客的URL地址的事情。 数据库 接下来,我们将编写代码以使在数据库中持久存储成为可能。我们主要的兴趣集中在如何在我们定义在PodTypes.hs中的haskell数据结构和磁盘上的数据库之间转移数据。此外,当用户第一次运行该程序时,用户将需要创建数据库表用来存储我们的数据。
我们将使用HDBS(见第21章)与SQLite数据库交互。SQLite是轻巧的并且是独立的,这使得它非常适合这个项目。有关安装HDBC和SQLite,查阅“安装HDBC和驱动程序”第494页: -- file: ch22/PodDB.hs
module PodDB where import Database.HDBC
import Database.HDBC.Sqlite3
import PodTypes
import Control.Monad(when)
import Data.List(sort) -- | Initialize DB and return database Connection
connect :: FilePath -> IO Connection
connect fp =
do dbh <- connectSqlite3 fp
prepDB dbh
return dbh
{- | Prepare the database for our data. We create two tables and ask the database engine to verify some pieces
of data consistency for us: * castid and epid both are unique primary keys and must never be duplicated
* castURL also is unique
* In the episodes table, for a given podcast (epcast), there must be only
one instance of each given URL or episode ID
-}
prepDB :: IConnection conn => conn -> IO ()
prepDB dbh =
do tables <- getTables dbh
when (not ("podcasts" `elem` tables)) $
do run dbh "CREATE TABLE podcasts (\
\castid INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,\
\castURL TEXT NOT NULL UNIQUE)" []
return ()
when (not ("episodes" `elem` tables)) $
do run dbh "CREATE TABLE episodes (\
\epid INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,\
\epcastid INTEGER NOT NULL,\
\epurl TEXT NOT NULL,\
\epdone INTEGER NOT NULL,\
\UNIQUE(epcastid, epurl),\
\UNIQUE(epcastid, epid))" []
return ()
commit dbh {- | Adds a new podcast to the database. Ignores the castid on the
incoming podcast, and returns a new object with the castid populated.
An attempt to add a podcast that already exists is an error. -}
addPodcast :: IConnection conn => conn -> Podcast -> IO Podcast
addPodcast dbh podcast =
handleSql errorHandler $
do -- Insert the castURL into the table. The database
-- will automatically assign a cast ID.
run dbh "INSERT INTO podcasts (castURL) VALUES (?)"
[toSql (castURL podcast)]
-- Find out the castID for the URL we just added.
r <- quickQuery' dbh "SELECT castid FROM podcasts WHERE castURL = ?"
[toSql (castURL podcast)]
case r of
[[x]] -> return $ podcast {castId = fromSql x}
y -> fail $ "addPodcast: unexpected result: " ++ show y
where errorHandler e =
do fail $ "Error adding podcast; does this URL already exist?\n"
++ show e {- | Adds a new episode to the database. Since this is done by automation instead of by user request, we will
simply ignore requests to add duplicate episodes. This way, when we are
processing a feed, each URL encountered can be fed to this function,
without having to first look it up in the DB. Also, we generally won't care about the new ID here, so don't bother
fetching it. -}
addEpisode :: IConnection conn => conn -> Episode -> IO ()
addEpisode dbh ep =
run dbh "INSERT OR IGNORE INTO episodes (epCastId, epURL, epDone) \
\VALUES (?, ?, ?)"
[toSql (castId . epCast $ ep), toSql (epURL ep),
toSql (epDone ep)]
>> return () {- | Modifies an existing podcast. Looks up the given podcast by
ID and modifies the database record to match the passed Podcast. -}
updatePodcast :: IConnection conn => conn -> Podcast -> IO ()
updatePodcast dbh podcast =
run dbh "UPDATE podcasts SET castURL = ? WHERE castId = ?"
[toSql (castURL podcast), toSql (castId podcast)]
>> return () {- | Modifies an existing episode. Looks it up by ID and modifies the
database record to match the given episode. -}
updateEpisode :: IConnection conn => conn -> Episode -> IO ()
updateEpisode dbh episode =
run dbh "UPDATE episodes SET epCastId = ?, epURL = ?, epDone = ? \
\WHERE epId = ?"
[toSql (castId . epCast $ episode),
toSql (epURL episode),
toSql (epDone episode),
toSql (epId episode)]
>> return () {- | Remove a podcast. First removes any episodes that may exist
for this podcast. -}
removePodcast :: IConnection conn => conn -> Podcast -> IO ()
removePodcast dbh podcast =
do run dbh "DELETE FROM episodes WHERE epcastid = ?"
[toSql (castId podcast)]
run dbh "DELETE FROM podcasts WHERE castid = ?"
[toSql (castId podcast)]
return () {- | Gets a list of all podcasts. -}
getPodcasts :: IConnection conn => conn -> IO [Podcast]
getPodcasts dbh =
do res <- quickQuery' dbh
"SELECT castid, casturl FROM podcasts ORDER BY castid" []
return (map convPodcastRow res) {- | Get a particular podcast. Nothing if the ID doesn't match, or
Just Podcast if it does. -}
getPodcast :: IConnection conn => conn -> Integer -> IO (Maybe Podcast)
getPodcast dbh wantedId =
do res <- quickQuery' dbh
"SELECT castid, casturl FROM podcasts WHERE castid = ?"
[toSql wantedId]
case res of
[x] -> return (Just (convPodcastRow x))
[] -> return Nothing
x -> fail $ "Really bad error; more than one podcast with ID" {- | Convert the result of a SELECT into a Podcast record -}
convPodcastRow :: [SqlValue] -> Podcast
convPodcastRow [svId, svURL] =
Podcast {castId = fromSql svId,
castURL = fromSql svURL}
convPodcastRow x = error $ "Can't convert podcast row " ++ show x {- | Get all episodes for a particular podcast. -}
getPodcastEpisodes :: IConnection conn => conn -> Podcast -> IO [Episode]
getPodcastEpisodes dbh pc =
do r <- quickQuery' dbh
"SELECT epId, epURL, epDone FROM episodes WHERE epCastId = ?"
[toSql (castId pc)]
return (map convEpisodeRow r)
where convEpisodeRow [svId, svURL, svDone] =
Episode {epId = fromSql svId, epURL = fromSql svURL,
epDone = fromSql svDone, epCast = pc} 在PodDB模块中,我们定义了连接数据库,创建所需表,添加,查询,删除数据的函数。这面是一个在ghci中演示的与数据库进行交互的例子。它将在当前的工作目录创建一个名为poddbtest.db的数据库文件,并添加一个播客和一个播放清单: ghci> :load PodDB.hs
[1 of 2] Compiling PodTypes ( PodTypes.hs, interpreted )
[2 of 2] Compiling PodDB ( PodDB.hs, interpreted )
Ok, modules loaded: PodDB, PodTypes.
ghci> dbh <- connect "poddbtest.db"
ghci> :type dbh
dbh :: Connection
ghci> getTables dbh
["episodes","podcasts","sqlite_sequence"]
ghci> let url = "http://feeds.thisamericanlife.org/talpodcast"
ghci> pc <- addPodcast dbh (Podcast {castId=0, castURL=url})
Podcast {castId = 1, castURL = "http://feeds.thisamericanlife.org/talpodcast"}
ghci> getPodcasts dbh
[Podcast {castId = 1, castURL = "http://feeds.thisamericanlife.org/talpodcast"}]
ghci> addEpisode dbh (Episode {epId = 0, epCast = pc, epURL =
"http://www.example.com/foo.mp3", epDone = False})
ghci> getPodcastEpisodes dbh pc
[Episode {epId = 1, epCast = Podcast {castId = 1, castURL =
"http://feeds.thisamericanlife.org/talpodcast"}, epURL = "http://www.example.com/foo.mp3",
epDone = False}]
ghci> commit dbh
ghci> disconnect dbh 解析器 现在,我们有数据库组件,我们需要有代码来解析播客feeds。这些XML文件包含丰富的信息。下面是一个XML文件的示例: <?xml version="1.0" encoding="utf-8"?>
<rss xmlns:itunes="http://www.itunes.com/DTDs/Podcast-1.0.dtd" version="2.0">
<channel>
<title>Haskell Radio</title>
<link>http://www.example.com/radio/</link>
<description>Description of this podcast</description>
<item>
<title>Episode 2: Lambdas</title>
<link>http://www.example.com/radio/lambdas</link>
<enclosure url="http://www.example.com/radio/lambdas.mp3" type="audio/mpeg" length="10485760"/>
</item>
<item>
<title>Episode 1: Parsec</title>
<link>http://www.example.com/radio/parsec</link>
<enclosure url="http://www.example.com/radio/parsec.mp3" type="audio/mpeg" length="10485150"/>
</item>
</channel>
</rss> 在这些文件之外,我们的主要感兴趣两件事情:播客标题和附带的网址。我们使用了HaXml工具箱(http://www.cs.york.ac.uk/fp/HaXml/)来解析XML文件。下面是这个组件的源代码: -- file: ch22/PodParser.hs
module PodParser where import PodTypes
import Text.XML.HaXml
import Text.XML.HaXml.Parse
import Text.XML.HaXml.Html.Generate(showattr)
import Data.Char
import Data.List data PodItem = PodItem {itemtitle :: String,
enclosureurl :: String
}
deriving (Eq, Show, Read) data Feed = Feed {channeltitle :: String,
items :: [PodItem]}
deriving (Eq, Show, Read) {- | Given a podcast and an PodItem, produce an Episode -}
item2ep :: Podcast -> PodItem -> Episode
item2ep pc item =
Episode {epId = 0,
epCast = pc,
epURL = enclosureurl item,
epDone = False} {- | Parse the data from a given string, with the given name to use
in error messages. -}
parse :: String -> String -> Feed
parse content name =
Feed {channeltitle = getTitle doc,
items = getEnclosures doc} where parseResult = xmlParse name (stripUnicodeBOM content)
doc = getContent parseResult getContent :: Document -> Content
getContent (Document _ _ e _) = CElem e {- | Some Unicode documents begin with a binary sequence;
strip it off before processing. -}
stripUnicodeBOM :: String -> String
stripUnicodeBOM ('\xef':'\xbb':'\xbf':x) = x
stripUnicodeBOM x = x {- | Pull out the channel part of the document. Note that HaXml defines CFilter as: > type CFilter = Content -> [Content]
-}
channel :: CFilter
channel = tag "rss" /> tag "channel" getTitle :: Content -> String
getTitle doc =
contentToStringDefault "Untitled Podcast"
(channel /> tag "title" /> txt $ doc) getEnclosures :: Content -> [PodItem]
getEnclosures doc =
concatMap procPodItem $ getPodItems doc
where procPodItem :: Content -> [PodItem]
procPodItem item = concatMap (procEnclosure title) enclosure
where title = contentToStringDefault "Untitled Episode"
(keep /> tag "title" /> txt $ item)
enclosure = (keep /> tag "enclosure") item getPodItems :: CFilter
getPodItems = channel /> tag "item" procEnclosure :: String -> Content -> [PodItem]
procEnclosure title enclosure =
map makePodItem (showattr "url" enclosure)
where makePodItem :: Content -> PodItem
makePodItem x = PodItem {itemtitle = title,
enclosureurl = contentToString [x]} {- | Convert [Content] to a printable String, with a default if the
passed-in [Content] is [], signifying a lack of a match. -}
contentToStringDefault :: String -> [Content] -> String
contentToStringDefault msg [] = msg
contentToStringDefault _ x = contentToString x {- | Convert [Content] to a printable string, taking care to unescape it. An implementation without unescaping would simply be: > contentToString = concatMap (show . content) Because HaXml's unescaping works only on Elements, we must make sure that
whatever Content we have is wrapped in an Element, then use txt to
pull the insides back out. -}
contentToString :: [Content] -> String
contentToString =
concatMap procContent
where procContent x =
verbatim $ keep /> txt $ CElem (unesc (fakeElem x)) fakeElem :: Content -> Element
fakeElem x = Elem "fake" [] [x] unesc :: Element -> Element
unesc = xmlUnEscape stdXmlEscaper 让我们来看看这段代码。首先,我们声明了两个类型:PodItem和Feed。我们将把XML文档转换成一个Feed,其中包含的相应的条目。我们还在PodTypes.hs中提供一个将PodItem默认转换成一个Episode的函数。
接下来,是时候解析了。 parse函数需要一个代表XML内容的字符串和一个代表错误消息名的字符串,然后返回一个Feed。
HaXml被设计成一个将一种类型转换成另一种类型的“过滤器”。它简单直接的将XML转换为XML,或者将XML转换为Haskell数据,或者将Haskell数据转换为XML。HaXml的拥有一种叫做CFILTER的数据类型,它的定义是这样的: type CFilter = Content -> [Content] 也就是说,一个CFilter需要一个XML文档的一些片段作为参数,并返回0个或多个片段。 CFilter可能被要求找到指定标签的所有孩子,所有标签带有一个确定的名称,这些文本包含了一个XML文档的一部分,或者任何的其他的一些东西。也有运算符(/>)将CFilter函数链接起来。所有我们感兴趣的数据均在<channel>标签当中,所以首先我们要理解,我们定义了一个简单的CFilter: channel = tag "rss" /> tag "channel" 当我们传递一个文档到一个频道的时候,它会搜索名为rss的标签的顶层。然后,在此范围内,它会寻找channel标签。
其余的程序遵循这个基本的方法。txt通过使用CFilter函数从标签中提取文本,我们可以得到文档的任何一部分。 下载 我们程序的下一部分是一个下载数据的模块。我们需要下载两种不同类型的数据:每个播放清单的播客的内容和音频。在前一种情况下,我们将解析数据,并更新我们的数据库。对于后者,我们会将数据写出到磁盘上的文件。
我们会从HTTP服务器下载,所以我们将使用一个Haskell HTTP库(http://www.haskell.org/http/)。为了下载播客feeds,我们将下载文档,解析它,并更新数据库。对于播放清单中的音频,我们将下载该文件,写入到磁盘上,并在数据库中标出它已下载。下面是代码: -- file: ch22/PodDownload.hs
module PodDownload where
import PodTypes
import PodDB
import PodParser
import Network.HTTP
import System.IO
import Database.HDBC
import Data.Maybe
import Network.URI {- | Download a URL. (Left errorMessage) if an error,
(Right doc) if success. -}
downloadURL :: String -> IO (Either String String)
downloadURL url =
do resp <- simpleHTTP request
case resp of
Left x -> return $ Left ("Error connecting: " ++ show x)
Right r ->
case rspCode r of
(2,_,_) -> return $ Right (rspBody r)
(3,_,_) -> -- A HTTP redirect
case findHeader HdrLocation r of
Nothing -> return $ Left (show r)
Just url -> downloadURL url
_ -> return $ Left (show r)
where request = Request {rqURI = uri,
rqMethod = GET,
rqHeaders = [],
rqBody = ""}
uri = fromJust $ parseURI url {- | Update the podcast in the database. -}
updatePodcastFromFeed :: IConnection conn => conn -> Podcast -> IO ()
updatePodcastFromFeed dbh pc =
do resp <- downloadURL (castURL pc)
case resp of
Left x -> putStrLn x
Right doc -> updateDB doc where updateDB doc =
do mapM_ (addEpisode dbh) episodes
commit dbh
where feed = parse doc (castURL pc)
episodes = map (item2ep pc) (items feed) {- | Downloads an episode, returning a String representing
the filename it was placed into, or Nothing on error. -}
getEpisode :: IConnection conn => conn -> Episode -> IO (Maybe String)
getEpisode dbh ep =
do resp <- downloadURL (epURL ep)
case resp of
Left x -> do putStrLn x
return Nothing
Right doc ->
do file <- openBinaryFile filename WriteMode
hPutStr file doc
hClose file
updateEpisode dbh (ep {epDone = True})
commit dbh
return (Just filename)
-- This function ought to apply an extension based on the file type
where filename = "pod." ++ (show . castId . epCast $ ep) ++ "." ++
(show (epId ep)) ++ ".mp3" 这个模块定义了三个函数:downloadURL,简单地下载URL然后将它以String返回;updatePodcastFromFeed,下载一个XML feed文件,解析它,并更新数据库; getEpisode,下载一个给定的播放清单并在数据库中记录。 %这里使用的HTTP库不是lazy读取HTTP的结果。因此,当下载大文件时,它可能会导致大量的RAM的消耗。其他库没有这个限制。我们用这一个,因为它是稳定的,易于安装,使用起来相当容易。我们建议mini-http,可从Hackage得到,适用于较苛刻的HTTP需求。 主程序 最后,我们需要一个主程序把所有这些结合在一起。这里是我们的主要模块: -- file: ch22/PodMain.hs
module Main where
import PodDownload
import PodDB
import PodTypes
import System.Environment
import Database.HDBC
import Network.Socket(withSocketsDo) main = withSocketsDo $ handleSqlError $
do args <- getArgs
dbh <- connect "pod.db"
case args of
["add", url] -> add dbh url
["update"] -> update dbh
["download"] -> download dbh
["fetch"] -> do update dbh
download dbh
_ -> syntaxError
disconnect dbh add dbh url =
do addPodcast dbh pc
commit dbh
where pc = Podcast {castId = 0, castURL = url} update dbh =
do pclist <- getPodcasts dbh
mapM_ procPodcast pclist
where procPodcast pc =
do putStrLn $ "Updating from " ++ (castURL pc)
updatePodcastFromFeed dbh pc download dbh =
do pclist <- getPodcasts dbh
mapM_ procPodcast pclist
where procPodcast pc =
do putStrLn $ "Considering " ++ (castURL pc)
episodelist <- getPodcastEpisodes dbh pc
let dleps = filter (\ep -> epDone ep == False)
episodelist
mapM_ procEpisode dleps
procEpisode ep =
do putStrLn $ "Downloading " ++ (epURL ep)
getEpisode dbh ep syntaxError = putStrLn
"Usage: pod command [args]\n\
\\n\
\pod add url Adds a new podcast with the given URL\n\
\pod download Downloads all pending episodes\n\
\pod fetch Updates, then downloads\n\
\pod update Downloads podcast feeds, looks for new episodes\n" 我们有一个非常简单的命令行解析器,它的一个功能是可以指出一个命令行语法错误,还有些小功能是来处理不同的命令行参数。
你可以编译这个程序用类似下面的命令: ghc --make -O2 -o pod -package HTTP -package HaXml -package network \
-package HDBC -package HDBC-sqlite3 PodMain.hs 另外,你可以使用一个Cabal文件,关于它的文档在第131页"Creating a Package": -- ch23/pod.cabal
Name: pod
Version: 1.0.0
Build-type: Simple
Build-Depends: HTTP, HaXml, network, HDBC, HDBC-sqlite3, base
Executable: pod
Main-Is: PodMain.hs
GHC-Options: -O2 并且,你需要一个Setup.hs文件: import Distribution.Simple
main = defaultMain 现在,用Cabal来构建,运行下面的命令即可: runghc Setup.hs configure
runghc Setup.hs build 然后你会发现一个dist的目录中包含你的输出。在系统中安装程序,运行runghc Setup.hs install。