11.29.2016

Yahoo movie parse with Rx

用C#抓網頁資料

最近有個小需求,要用C#抓網頁資料,先找了下解析器,原本是想用nsoup,後來經過比較後還是選擇了HtmlAgilityPack。

先建了Model

public class MovieInfo
{
    public Uri ImageUri { get; set; }
    public string ChineseName { get; set; }
    public string EnglishName { get; set; }
    public DateTime ReleaseDateTime { get; set; }
    public string BriefDescription { get; set; }
    public string FullyDescription { get; set; }

    public override string ToString()
    {
        return string.Format("Movie [{0}], [{1}], [{2}], [{3}], [{4}]", 
            ChineseName, 
            EnglishName, 
            ReleaseDateTime,
            BriefDescription, 
            ImageUri);
    }
}

而Parser的部份(原本是要和ScrapySharp一起使用,後來看到範例可以簡單的解析,就沒有另外學了,話說XPath要花點時間瞭解下才對)

先看奇摩電影上的html結構

<div class="item">
    <div class="img">
        <a href="https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6344">
            <img src="https://s.yimg.com/vu/movies/fp/mpost4/63/44/6344.jpg" title="死亡筆記本:決戰新世界">
        </a>
    </div>
    <div class="text">
        <h4>
            <a href="https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6344">死亡筆記本:決戰新世界</a>
        </h4>
        <h5>
            <a href="https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6344">Death
                Note Light up the NEW world</a></h5>
        <span class="date">上映日期:<span>2016-11-25</span></span>
        <p>
            ★ 史上最經典鬥智推理代表作《死亡筆記本》,電影版十年後全新篇章再起! ★ 《寄生獸》東出昌大 X 《紙之月》池松壯亮 X 《暗殺教室》
            <ins>...<a href="movieinfo_main.html/id=6344" hpp="thisweek-guide">詳全文</a></ins>
        </p>
        <div class="clearfix">
            <ul class="links clearfix">
                <li class="intro"><a
                        href="https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6344">電影介紹</a>
                </li>
                <li class="trailer"><a
                        href="https://tw.rd.yahoo.com/referurl/movie/thisweek/trailer/*https://tw.movies.yahoo.com/video/死亡筆記本-決戰新世界-中文版預告-015209257.html">預告片</a>
                </li>
                <li class="photo"><a
                        href="https://tw.rd.yahoo.com/referurl/movie/thisweek/photo/*https://tw.movies.yahoo.com/movieinfo_photos.html/id=6344">劇照</a>
                </li>
                <li class="time"><a
                        href="https://tw.rd.yahoo.com/referurl/movie/thisweek/time/*https://tw.movies.yahoo.com/movietime_result.html/id=6344">時刻表</a>
                </li>
            </ul>
        </div>
    </div>
</div>

可以發現每部電影都是放在div class=”item”中的,所以解析的程式就從這個節點開始處理,下列程式碼中的nodes就是當頁所有電影的資訊,在Parsing時,順便用看看Parallel.ForEach這個功能,不過目前並沒有實際量測總花費時間。

public static class YahooMovieParser
{
    private static readonly object Sync = new object();
    public static List<MovieInfo> Parse(string webContent)
    {
        var html = new HtmlAgilityPack.HtmlDocument();
        html.LoadHtml(webContent);

        var root = html.DocumentNode;
        var nodes = root.Descendants()
            .Where(n => n.GetAttributeValue("class", "").Equals("item"));

        var movieInfos = new List<MovieInfo>();

        // if not consider the order
        Parallel.ForEach(nodes, node =>
        {
            var mi = new MovieInfo();

            var divImg = node
                .Descendants().Single(n => n.GetAttributeValue("class", "").Equals("img"))
                .Descendants("a").Single()
                .Descendants("img").Single()
                .Attributes[0].Value;

            var divText = node.Descendants()
                .Single(n => n.GetAttributeValue("class", "").Equals("text"));

            var cname = divText
                .Descendants("h4").Single()
                .Descendants("a").Single()
                .InnerText;

            var ename = divText
                .Descendants("h5").Single()
                .Descendants("a").Single()
                .InnerText;

            var rDate = divText
                .Descendants("span").FirstOrDefault()
                .ChildNodes[1].InnerText;

            var briefDescription = divText
                .Descendants("p").Single()
                .FirstChild
                .InnerText;

            mi.ImageUri = new Uri(divImg);
            mi.ChineseName = cname;
            mi.EnglishName = ename;
            mi.ReleaseDateTime = DateTime.ParseExact(rDate, "yyyy-MM-dd", CultureInfo.InvariantCulture);
            mi.BriefDescription = briefDescription;

            lock (Sync)
            {
                movieInfos.Add(mi);
            }
        });
        return movieInfos;
    }
}

不過這些不是重點,重點是試用目前正在學的Rx的方式來完成,如下

private IObservable<List<MovieInfo>> ParseMovie(string url)
{
    var wc = new WebClient() { Encoding = Encoding.UTF8 };

    // when received the download completed event, 
    // parse the data and return to the caller
    IObservable<List<MovieInfo>> observable = Observable
        .FromEventPattern<DownloadStringCompletedEventArgs>(wc, "DownloadStringCompleted")
        .Select(item =>
        {
            var data = item.EventArgs.Result;
            return YahooMovieParser.Parse(data);
        });

    wc.DownloadStringAsync(new Uri(url));

    return observable;
}

話說我之前在知道可以用var後,就儘量使用它,不過後來又看到有不同的意見,想了想後,就決定在很明顯的地方才用var宣告型態,因為像上面的Observable.FromEventPattern的回傳型態,我又用了select,無法光看就知道實際型態(要移動滑鼠讓IDE顯示)。

程式中的observable,取代了我們原本在採用非同步呼叫時所用的callback等方式,指定在完成後取得結果,並執行解析動作,讓後來的訂閱者(如下方程式碼中的iCanBeDisposed)取用.

private void Form1_Load(object sender, EventArgs e)
{
    // parse movie every second, output the result when finished.
    IDisposable iCanBeDisposed = Observable.Interval(TimeSpan.FromSeconds(10))
        .ObserveOn(SynchronizationContext.Current)
        .Subscribe(count =>
        {
            ParseMovie(URL_THIS_WEEK)
                .Subscribe(movieInfos =>
                {
                    listBox1.Items.Clear();
                    listBox1.Items.Add("本週新片");
                    listBox1.Items.Add("Count: " + count);

                    foreach (var item in movieInfos)
                    {
                        listBox1.Items.Add(item.ToString());
                    }
                });

            ParseMovie(URL_IN_THEATERS)
                .Subscribe(movieInfos =>
                {
                    listBox2.Items.Clear();
                    listBox2.Items.Add("上映中");
                    listBox2.Items.Add("Count: " + count);

                    foreach (var item in movieInfos)
                    {
                        listBox2.Items.Add(item.ToString());
                    }
                });
        },
        ex => Trace.WriteLine(ex));
}

由於上映中的電影有分頁,之後再試用非同步方式撈出其它分頁的資料,以及電影本身的詳細資料。

沒有留言:

張貼留言