public static void RegularExpress() { var d0 = "宏润建设集团股份有限公司(以下简称“公司”)于2014年1月7日收到西安市建设工程中标通知书,“西安市地铁四号线工程(航天东路站—北客站)土建施工D4TJSG-5标”项目由公司中标承建,工程中标价49,290万元。"; var x0 = RegularTool.GetMultiValueBetweenMark(d0, "“", "”"); var d1 = DateUtility.GetDate("河北先河环保科技股份有限公司董事会二○一二年十一月三十日"); Console.WriteLine(d1); var d2 = "公司第五届董事会第七次会议审议通过了《关于公司与神华铁路货车运输有限责任公司签订企业自用货车购置供货合同的议案》,2014年1月20日,公司与神华铁路货车运输有限责任公司签署了《企业自用货车购置供货合同》。"; var x2 = RegularTool.GetValueBetweenString(d2, "与", "签订"); var s0 = "2010年12月3日,中工国际工程股份有限公司与委内瑞拉农业土地部下属的委内瑞拉农业公司签署了委内瑞拉农副产品加工设备制造厂工业园项目商务合同,与委内瑞拉农签署了委内瑞拉奥里合同。"; var x = RegularTool.GetMultiValueBetweenString(s0, "与", "签署"); var s1 = "收到贵州高速公路开发总公司发出的通知"; var s2 = "接到贵州高速公路开发总公司发出的通知"; var s3 = "收到贵州高速公路开发总公司发出的告知"; var s4 = "接到贵州高速公路开发总公司发出的告知"; Regex rg = new Regex("(?<=(" + "收到|接到" + "))[.\\s\\S]*?(?=(" + "通知|告知" + "))", RegexOptions.Multiline | RegexOptions.Singleline); Console.WriteLine(rg.Match(s1).Value); Console.WriteLine(rg.Match(s2).Value); Console.WriteLine(rg.Match(s3).Value); Console.WriteLine(rg.Match(s4).Value); }
//获得日期 public static List <LocAndValue <DateTime> > LocateDate(HTMLEngine.MyRootHtmlNode root) { var list = new List <LocAndValue <DateTime> >(); foreach (var paragrah in root.Children) { foreach (var sentence in paragrah.Children) { var OrgString = sentence.Content; OrgString = DateUtility.ConvertUpperToLower(OrgString).Replace(" ", String.Empty); var datelist = DateUtility.GetDate(OrgString); foreach (var strDate in datelist) { var DateNumberList = RegularTool.GetNumberList(strDate); String Year = DateNumberList[0]; String Month = DateNumberList[1]; String Day = DateNumberList[2]; int year; int month; int day; if (int.TryParse(Year, out year) && int.TryParse(Month, out month) && int.TryParse(Day, out day)) { list.Add(new LocAndValue <DateTime>() { Loc = sentence.PositionId, Type = "日期", Value = DateUtility.GetWorkDay(year, month, day) }); } } } } return(list); }
/// <summary> /// 分析 /// </summary> /// <param name="htmlfile"></param> /// <param name="TextFileName"></param> /// <returns></returns> public MyRootHtmlNode Anlayze(string htmlfile, string TextFileName) { TableId = 0; DetailItemId = 0; TableList = new Dictionary <int, List <String> >(); DetailItemList = new Dictionary <int, List <String> >(); //一般来说第一个都是DIV, <div title="关于重大合同中标的公告" type="pdf"> var doc = new HtmlDocument(); doc.Load(htmlfile); var node = doc.DocumentNode.SelectNodes("//div[@type='pdf']"); var root = new MyRootHtmlNode(); if (node == null) { return(root); } root.Content = node[0].Attributes["title"].Value; //第二层是所有的一定是Paragraph foreach (var SecondLayerNode in node[0].ChildNodes) { //Console.WriteLine(SecondLayerNode.Name); //跳过#text的节 if (SecondLayerNode.Name == "div") { var title = String.Empty; if (SecondLayerNode.Attributes.Contains("title")) { title = SecondLayerNode.Attributes["title"].Value; } else { title = SecondLayerNode.InnerText; } var secondNode = new MyHtmlNode(); secondNode.Content = title; AnlayzeParagraph(SecondLayerNode, secondNode); FindContentWithList(secondNode.Children); for (int i = 0; i < secondNode.Children.Count - 1; i++) { secondNode.Children[i].NextBrother = secondNode.Children[i + 1]; } for (int i = 1; i < secondNode.Children.Count; i++) { secondNode.Children[i].PreviewBrother = secondNode.Children[i - 1]; } root.Children.Add(secondNode); } } //特殊字符的矫正 foreach (var x1 in root.Children) { x1.Content = CorrectHTML(x1.Content); foreach (var x2 in x1.Children) { x2.Content = CorrectHTML(x2.Content); } } //最后一个段落的检索 var LastParagrah = root.Children.Last(); if (LastParagrah.Children.Count > 0) { //重大合同:1232951 var LastSentence = LastParagrah.Children.Last().Content; var sentence = DateUtility.ConvertUpperToLower(LastSentence); var dateList = DateUtility.GetDate(sentence); if (dateList.Count > 0) { var strDate = dateList.Last(); if (!String.IsNullOrEmpty(strDate)) { var strBefore = Utility.GetStringBefore(sentence, strDate); if (!String.IsNullOrEmpty(strBefore)) { //尾部除去 LastParagrah.Children.RemoveAt(LastParagrah.Children.Count - 1); strBefore = LastSentence.Substring(0, LastSentence.LastIndexOf("年") - 4); LastParagrah.Children.Add(new MyHtmlNode() { Content = strBefore }); LastParagrah.Children.Add(new MyHtmlNode() { Content = strDate }); } } } } //根据文本文件内容进行调整 if (File.Exists(TextFileName)) { //重大合同之外,其实都无需做 AdjustItemList(root, TextFileName); AdjustTwoLine(root, TextFileName); } for (int i = 0; i < root.Children.Count - 1; i++) { root.Children[i].NextBrother = root.Children[i + 1]; } for (int i = 1; i < root.Children.Count; i++) { root.Children[i].PreviewBrother = root.Children[i - 1]; } for (int i = 0; i < root.Children.Count; i++) { root.Children[i].PositionId = i + 1; for (int j = 0; j < root.Children[i].Children.Count; j++) { root.Children[i].Children[j].PositionId = (i + 1) * 100 + j + 1; } } root.TableList = TableList; root.DetailItemList = DetailItemList; return(root); }