//Parse IMDb page data private static void ParseIMDbPage(string imdbUrl, bool GetExtraInfo, ImdbMovie mov) { string html = GetUrlData(imdbUrl + "combined"); mov.Id = match(@"<link rel=""canonical"" href=""http://www.imdb.com/title/(tt\d{7})/combined"" />", html); if (!string.IsNullOrEmpty(mov.Id)) { mov.Status = true; mov.Title = match(@"<title>(IMDb \- )*(.*?) \(.*?</title>", html, 2); mov.OriginalTitle = match(@"title-extra"">(.*?)<", html); mov.Year = match(@"<title>.*?\(.*?(\d{4}).*?\).*?</title>", match(@"(<title>.*?</title>)", html)); mov.Rating = match(@"<b>(\d.\d)/10</b>", html); mov.Genres = MatchAll(@"<a.*?>(.*?)</a>", match(@"Genre.?:(.*?)(</div>|See more)", html)).Cast <string>().ToList(); mov.Plot = match(@"Plot:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html); //mov.Directors = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Directed by</a></h5>(.*?)</table>", html)); //mov.Writers = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Writing credits</a></h5>(.*?)</table>", html)); //mov.Producers = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Produced by</a></h5>(.*?)</table>", html)); //mov.Musicians = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Original Music by</a></h5>(.*?)</table>", html)); //mov.Cinematographers = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Cinematography by</a></h5>(.*?)</table>", html)); //mov.Editors = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Film Editing by</a></h5>(.*?)</table>", html)); //mov.Cast = matchAll(@"<td class=""nm""><a.*?href=""/name/.*?/"".*?>(.*?)</a>", match(@"<h3>Cast</h3>(.*?)</table>", html)); //mov.PlotKeywords = matchAll(@"<a.*?>(.*?)</a>", match(@"Plot Keywords:</h5>.*?<div class=""info-content"">(.*?)</div", html)); //mov.ReleaseDate = match(@"Release Date:</h5>.*?<div class=""info-content"">.*?(\d{1,2} (January|February|March|April|May|June|July|August|September|October|November|December) (19|20)\d{2})", html); //mov.Runtime = match(@"Runtime:</h5><div class=""info-content"">(\d{1,4}) min[\s]*.*?</div>", html); //mov.Top250 = match(@"Top 250: #(\d{1,3})<", html); //mov.Oscars = match(@"Won (\d+) Oscars?\.", html); //if (string.IsNullOrEmpty(mov.Oscars) && "Won Oscar.".Equals(match(@"(Won Oscar\.)", html))) mov.Oscars = "1"; //mov.Awards = match(@"(\d{1,4}) wins", html); //mov.Nominations = match(@"(\d{1,4}) nominations", html); //mov.Tagline = match(@"Tagline:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html); //mov.MpaaRating = match(@"MPAA</a>:</h5><div class=""info-content"">Rated (G|PG|PG-13|PG-14|R|NC-17|X) ", html); //mov.Votes = match(@">(\d+,?\d*) votes<", html); //mov.Languages = matchAll(@"<a.*?>(.*?)</a>", match(@"Language.?:(.*?)(</div>|>.?and )", html)); //mov.Countries = matchAll(@"<a.*?>(.*?)</a>", match(@"Country:(.*?)(</div>|>.?and )", html)); mov.Poster = match(@"<div class=""photo"">.*?<a name=""poster"".*?><img.*?src=""(.*?)"".*?</div>", html); if (!string.IsNullOrEmpty(mov.Poster) && mov.Poster.IndexOf("media-imdb.com") > 0) { mov.Poster = Regex.Replace(mov.Poster, @"_V1.*?.jpg", "_V1._SY200.jpg"); //mov.PosterLarge = Regex.Replace(mov.Poster, @"_V1.*?.jpg", "_V1._SY500.jpg"); //mov.PosterFull = Regex.Replace(mov.Poster, @"_V1.*?.jpg", "_V1._SY0.jpg"); } else { mov.Poster = string.Empty; //mov.PosterLarge = string.Empty; //mov.PosterFull = string.Empty; } mov.ImdbURL = "http://www.imdb.com/title/" + mov.Id + "/"; if (GetExtraInfo) { string plotHtml = GetUrlData(imdbUrl + "plotsummary"); //mov.Storyline = match(@"<p class=""plotpar"">(.*?)(<i>|</p>)", plotHtml); GetReleaseDatesAndAka(mov); //mov.MediaImages = getMediaImages(mov); //mov.RecommendedTitles = getRecommendedTitles(mov); } } }
public static ImdbMovie ImdbScrapeFromId(string imdbId, bool GetExtraInfo = true) { ImdbMovie mov = new ImdbMovie(); string imdbUrl = "http://www.imdb.com/title/" + imdbId + "/"; mov.Status = false; ParseIMDbPage(imdbUrl, GetExtraInfo, mov); return(mov); }
//Constructor public static ImdbMovie ImdbScrape(string MovieName, bool GetExtraInfo = true) { ImdbMovie mov = new ImdbMovie(); string imdbUrl = GetIMDbUrl(System.Uri.EscapeUriString(MovieName)); mov.Status = false; if (!string.IsNullOrWhiteSpace(imdbUrl)) { ParseIMDbPage(imdbUrl, GetExtraInfo, mov); } return(mov); }
//Get Recommended Titles private static ArrayList GetRecommendedTitles(ImdbMovie mov) { ArrayList list = new ArrayList(); string recUrl = "http://www.imdb.com/widget/recommendations/_ajax/get_more_recs?specs=p13nsims%3A" + mov.Id; string json = GetUrlData(recUrl); list = MatchAll(@"title=\\""(.*?)\\""", json); HashSet <String> set = new HashSet <string>(); foreach (String rec in list) { set.Add(rec); } return(new ArrayList(set.ToList())); }
//Get all media images private static ArrayList GetMediaImages(ImdbMovie mov) { ArrayList list = new ArrayList(); string mediaurl = "http://www.imdb.com/title/" + mov.Id + "/mediaindex"; string mediahtml = GetUrlData(mediaurl); int pagecount = MatchAll(@"<a href=""\?page=(.*?)"">", match(@"<span style=""padding: 0 1em;"">(.*?)</span>", mediahtml)).Count; for (int p = 1; p <= pagecount + 1; p++) { mediahtml = GetUrlData(mediaurl + "?page=" + p); foreach (Match m in new Regex(@"src=""(.*?)""", RegexOptions.Multiline).Matches(match(@"<div class=""thumb_list"" style=""font-size: 0px;"">(.*?)</div>", mediahtml))) { String image = m.Groups[1].Value; list.Add(Regex.Replace(image, @"_V1\..*?.jpg", "_V1._SY0.jpg")); } } return(list); }
//Get all release dates and aka-s private static void GetReleaseDatesAndAka(ImdbMovie mov) { Dictionary <string, string> release = new Dictionary <string, string>(); string releasehtml = GetUrlData("http://www.imdb.com/title/" + mov.Id + "/releaseinfo"); foreach (string r in MatchAll(@"<tr class="".*?"">(.*?)</tr>", match(@"<table id=""release_dates"" class=""subpage_data spFirst"">\n*?(.*?)</table>", releasehtml))) { Match rd = new Regex(@"<td>(.*?)</td>\n*?.*?<td class=.*?>(.*?)</td>", RegexOptions.Multiline).Match(r); release[StripHTML(rd.Groups[1].Value.Trim())] = StripHTML(rd.Groups[2].Value.Trim()); } //mov.ReleaseDates = release; Dictionary <string, string> aka = new Dictionary <string, string>(); ArrayList list = MatchAll(@".*?<tr class="".*?"">(.*?)</tr>", match(@"<table id=""akas"" class=.*?>\n*?(.*?)</table>", releasehtml)); foreach (string r in list) { Match rd = new Regex(@"\n*?.*?<td>(.*?)</td>\n*?.*?<td>(.*?)</td>", RegexOptions.Multiline).Match(r); aka[StripHTML(rd.Groups[1].Value.Trim())] = StripHTML(rd.Groups[2].Value.Trim()); } mov.Aka = aka; }