케세라세라: 기계학습과 침입탐지: 문법 추론 결과

네트워크를 통해 전달되는 데이터는 최종적으로 인간이 식별 가능한 문자나 숫자 등의 기호로 표현되는 정보를 송수신하면서 상호간에 서로 의도했던 의미를 수립한다. 데이터 발생 주체가 인간인만큼 너무나 당연하게도 인간의 의사소통과 수단은 다를지언정 그 성격은 똑같다 - IDS와 보안관제의 완성 (29페이지)

'IDS와 보안관제의 완성' 집필 당시, 같은 문자열 패턴에 의해 공격과 정상이 갈리는 문제에 대한 근본적인 해결책을 고민하면서 '현대 기호학의 발전'이란 책을 읽어본 적이 있다.

의사소통에 사용되는 문자 체계의 기본 원리에 해답이 있지 않을까 싶어 시도한 독서. 그런데 기호학이란 분야가 너무 낯설더라(..)

황새 따라가다 가랑이 찢어지는 경험을 한 후, '송충이는 역시 솔잎' 마인드를 고수해왔는데, 최근 침입탐지 분야에 대한 기계학습 적용 가능성을 타진하는 논문을 접하게 됐다. 다들 어떻게든 기계학습 갖다 붙여보려는 상황에서 비관적인 결론이 인상적.

그런데 논문 저자가 컴퓨터 과학을 섭렵한 엄친아라 그런지, 온갖 외계 용어와 수식이 난무한다. 촘스키 나오고 오토마타 나오는 순간 뇌기능 정지(..) 컴공 전공했으면 좀 나았을래나? 다음은 유일하게 이해 가능했던 내용.

any practical method should therefore be restricted to a particular domain

도메인을 좁히지 않으면 (침입탐지 분야에 대한 기계학습 적용은) 어렵다

수천 개의 침입탐지룰 전반에 통용되는 보편적 규칙을 찾기란 매우 힘들기 때문일 것이다. ~~왜냐하면 그런 거 없음~~ 잘 분류하면 기계학습 범위를 좁힐 수도 있지 않을까?

애당초 룰을 잘 만들 생각을 하지 않고 대충 만든 룰을 기계학습으로 어찌 해보겠다는 생각, 깨진 독을 방지한 채 물을 부어주는 기계를 만들겠다는 생각 자체가 모순이라 생각한다.

AI와 규칙 기반 엔진 사이의 차이점은 규칙 기반 엔진이 누군가 규칙을 업데이트할 때까지 스스로 개선하는 것은 불가능하다는 점

해당 논문을 읽고(?) 어렴풋이 드는 느낌이라도 얘기해보자면, 우리가 원하는 (알아서 분석해주는) 인공지능을 만들려면 말장난을 이해하는 수준의 인공지능을 먼저 개발한 후, 프로토콜과 애플리케이션을 가르치는 게 더 빠를 거 같다는 정도?

인간은 사고(특히 상상력)와 언어를 통해 만물의 영장이 되었다. 그래서 개인적으로 컴퓨터의 자연어 처리가 완벽해지는 날, 컴퓨터가 '아 다르고 어 다른' 자연어를 자연(?)스럽게 구사하게 되는 날, 인간을 웃기고 싶어서 개그를 치게 되는 날이 진짜 인공지능이 구현되는 날이라고 생각한다.

구글 듀플렉스가 제대로 작동할 수 있었던 비결은 구글이 인정한 대로 널리 연구할 수 있는 닫힌 분야와 매우 제한된 종류의 데이터로 범위를 정했기 때문입니다. 미용실에 전화해 예약하는 대화가 정확히 거기에 해당하죠. 구글 듀플렉스는 그런 좁은 분야에 대한 상당한 훈련을 받은 다음에야 사람이 말하는 듯한 대화를 할 수 있습니다. 광범위한 주제를 아우르는 열린 대화는 아직 요원합니다.

참고로 저자 Richard Harang은 현재 sophos에서 malware detect 분야에 대한 기계학습을 연구하고 있단다. 네트워크 침입탐지란 복잡한 도메인 대신 malware란 도메인에만 집중하기로 한 모양.

그나저나 이 논문, 이해하고 싶다(..) 그래서 지나가는 능력자분의 쉬운 설명을 기대하며 전문을 올려본다. ~~저자나 미 국방부에서 태클 걸면 어쩌지?~~

Machine Learning and Network Intrusion Detection: Results from Grammatical Inference

Machine learning for network intrusion detection is an area of ongoing and active research, however nearly all results in this area are empirical in nature, and despite the significant amount of work that has been performed in this area, very few such systems have received nearly the widespread support or adoption that manually configured systems such as Bro or Snort have.
네트워크 침입탐지를 위한 기계학습은 지속적이고 활발한 연구 분야지만, 많은 노력에도 불구하고 발전은 지지부진, 현실은 Bro나 Snort처럼 사람이 일일이 셋팅해줘야 하는 시스템이 주력.

As discussed in [1], there are several differences between more conventional applications of machine learning and machine learning for network intrusion detection that make intrusion detection a challenging domain; these include the overwhelming class imbalance, the high asymmetry in misclassification costs, the difficulty in evaluating the performance of an intrusion detection system, and the constantly changing nature of network attacks.
네트워크 침입탐지를 위한 기계학습과 타 분야의 기계학습간에는 차이점이 있다. 클래스 불균형(정확한 데이터와 그렇지 않은 데이터, 정오탐 얘기인가?), 잘못된 분류에 의한 높은 비용(이것도 정오탐 얘기하는 거 같은데 -_-), 어려운 침입탐지 시스템의 성능 평가 및 끊임없이 변화하는 네트워크 공격.

However; these arguments, which largely stem from empirical observations about the distribution and impact of attacks, do not address a more fundamental question: whether or not it is possible from a theoretical perspective to use machine learning to construct a general and effective intrusion detection system.
공격의 분포와 영향에 대한 경험적 관찰에서 기인하는 이런 주장은 기계학습을 사용하여 효과적인 침입탐지 시스템을 구축할 수 있는지 여부를 다루지 않는다.

This is the sort of fundamental question that we seek to answer with a “fundamental science of cybersecurity” [5], which has so far proved to be somewhat elusive.
지금까지는 이런 의문에 대한 답은 애매했다. 우리는 "사이버 보안의 근본적인 과학"으로 답을 찾아보려 한다.

In this article, we collect results from the field of grammatical inference to show that the use of machine learning for intrusion detection via deep packet inspection is inherently limited, and cannot be expected to generalize effectively even if the proposed algorithm can be shown to perform well on some particular task.
이 문서는 문법 추론 결과를 통해 침입탐지를 위한 기계학습 사용이 본질적으로 제한되어 있으며, 제안된 알고리즘이 특정 작업에서 잘 수행된다 하더라도 효과적 일반화를 기대할 수 없음을 보여준다.

Network inputs, file formats, and other automatically generated input structures (henceforth: ‘messages’) are by construction formal grammars, in which the message is both produced and interpreted according to a fixed set of rules.
네트워크 입력, 파일 형식 및 기타 자동 생성 입력 구조는 고정된 규칙 집합에 따라 메시지를 생성하고 해석하는 공식 문법을 따른다.

If we wish to identify a particular class of message as either “safe” or “malicious”, we are then implicitly attempting to learn a recognizer for the (perhaps merely implicit) formal grammar underlying that class of message format or protocol.
우리는 특정 메시지를 "안전" 또는 "공격"으로 분류할 때, 메시지나 프로토콜을 구성하는 공식 문법을 배우려고 시도한다.

This places the problem firmly within the realm of grammatical inference, a field with a deep theoretical literature. And indeed, we can find strong parallels, if not outright reductions, from many commonly encountered problems in machine learning for intrusion detection and grammatical inference.
이것은 문법 추론 영역이며, 침입탐지 기계학습과 문법 추론 기계학습 사이에는 유사성이 있다.

In the following, we present brief background on grammatical inference, explore some common approaches to network intrusion detection through the lens of computational and show how these results illuminate several common approaches to machine learning for network intrusion detection.
다음은 문법적 추론에 대한 간단한 배경을 제시하고, 네트워크 침입탐지에 대한 일반적 접근법을 탐구하며, 이런 결과가 네트워크 침입탐지를 위한 기계학습의 몇 가지 공통적 접근법을 보여준다.

In particular, these results suggest that general purpose machine learning approaches to network intrusion detection are – from first principles – not computationally tractable, and in fact may be attempting to solve problems that range from cryptographically hard to NP-complete.
이런 결과는 네트워크 침입탐지에 대한 범용 기계 학습 접근이 어려우며 실제로 암호화부터 자연어 처리 등의 문제를 해결해야할 수도 있다.

Key Concepts from Formal Language Theory and Grammatical Inference

The key concepts from formal language theory that we will require are “languages”, “grammars”, and “recognizers”.
형식언어 이론의 핵심 개념인 "언어", "문법" 그리고 "인식자"가 필요하다.

Grammars are defined as tuples: G=(N,Σ,R,i) where N is a set of nonterminal symbols, Σ is a set of all possible symbols (terminal and nonterminal), R is a set of production rules indicating which symbols in the string can be replaced with other symbols (the ‘left’ and ‘right’ side of the rule, respectively), and i is an initial symbol. A “production” of a grammar is a sequence of rule applications to a sequence until it consists of only terminal symbols; the (typically infinite) set of all productions of a grammar is the language L produced by the grammar, often denoted L(G). A recognizer for a grammar is some algorithm which takes some input sequence as input, and either returns “accept” if s∈L(G), or “do not accept” if S∉L(G).
언어학이나 자연어 처리 분야의 이론 설명인 듯? 이해 포기 -_-

The various restrictions that are placed on the production rules place the grammar into one class or another.
생산 규칙의 다양한 제한들이 문법을 특정 분류 또는 다른 분류로 나눈다.

The most commonly encountered classification of grammars – the Chomsky hierarchy [6] – classifies grammars in increasing complexity as regular, context-free, context-sensitive, and unrestricted;
일반적으로 문법 복잡성 증가도에 따라 규칙적인, 문맥에 구애받지 않는, 문맥에 민감한, 무제한으로 분류한다.

each class being contained in the following one, having successively fewer restrictions on the production rules, and requiring a successively more powerful model of computation in order to recognize.
각 분류는 생산 규칙 제한이 더 적고, 더 강력한 연산 모델이 필요한 분류에 포함된다.

Regular grammars, for instance, may only have rules in which the left side contains a single nonterminal symbol, and the right side contains an empty string, a terminal symbol, or a terminal symbol followed by a nonterminal symbol.
정규 문법은 왼쪽에 무슨 기호가 오고, 오른쪽에는 무슨 기호가 오고 어쩌고 저쩌고(..)

The Chomsky hierarchy is of particular interest because each class within the Chomsky hierarchy corresponds precisely to a particular model of computation which is required to build a recognizer for a language in that class. Regular grammars, for instance, can be recognized by finite automata, while context-free grammars require the addition of a stack in order to be able to construct a recognizer for them.
아주 촘스키 나오고, 오토마타 나오고 난리 났네 -_-

Context-free grammars are of particular importance as the Backus-Naur notation, which is frequently used to describe network protocols and occasionally file formats are itself a notation for context-free grammars.
Context-free 문법은 Backus-Naur 표기법으로 특히 중요하며, 네트워크 프로토콜을 설명할 때도 자주 사용되고, 때로는 파일 형식 자체가 Context-free 문법의 표기법이다.

Grammars and recognizers then fix a set of production/recognition rules, and produce decisions about strings. Grammatical inference (see [7] for a broad review) tackles the inverse problem: given some set of data relating to an unknown grammar, can we produce decisions about properties of that grammar?
문법 및 인식기는 일련의 생산/인식 규칙을 수정하고 문자열을 결정하며, 문법적 추론은 역 문제를 해결한다. 알 수 없는 문법에 관련된 약간의 데이터가 주어진다면, 우리는 그 문법의 특성을 결정할 수 있을까?

Example data available for training may include only strings {s+:s∈Σ*∩L} in the language generated by a grammar – a “positive presentation” – or a set of strings s={s+:s∈Σ*∩L}∪{s-:s∈Σ* \L } combined with labels n : {0,1} which indicate if s∈s+ or s∈s– (a “complete presentation”).
교육에 사용할 수있는 예제 데이터에는 문법에 의해 생성된 언어의 문자열 또는 n : {0,1} 라벨과 결합된 문자열 집합만 포함될 수 있다.

The available data can in some cases also be obtained dynamically: the learning algorithm may have access to additional information, such as an oracle that when given a conjectured grammar G’ will either confirm its accuracy if G’=G, or provide a counterexample s∈L(G)\L(G’) otherwise (“learning from a teacher”).
사용 가능한 데이터는 경우에 따라 동적으로 얻을 수도 있다. 학습 알고리즘은 추측 문법 G'가 주어질 때 G'=G'가 아니면 정확성을 확인하거나, 또는 counterexample s∈L(G)\L(G’)와 같은 추가 정보에 액세스 할 수 있다.

One of the earliest formalizations for what it means to “learn” a grammar is “learning in the limit” [8], in which (informally) the learning algorithm requires only a finite amount of data to produce a correct answer from which it never then deviates.
문법을 "배우는" 것이 무엇을 의미하는지에 대한 최초의 공식 중 하나는 학습 알고리즘이 정답을 생성하기 위해 유한한 양의 데이터만 필요로 하는 "한계 학습"이다.

Note that this is not a particularly strong model of learning: the amount of data required can be arbitrarily large so long as it is finite, and it places no restriction on the size of the learned automata;
이것은 강력한 학습 모델이 아니다. 필요한 데이터양이 임의로 커질 수 있으며 학습된 오토마타의 크기에는 제한을 두지 않는다.

it need not be minimal or even bounded. More realistic models of learning include the “Probably Approximately Correct” (PAC) framework [9], rather than requiring L(G’) = L(G) always, we might require only that over some distribution of strings and some distribution of training data T, we require that PT(PD(s∉L (G’)∩L(G))<ϵ)≥1-δ;
최소거나 경계가 있을 필요는 없다. 보다 현실적인 학습 모델은 L (G ') = L (G)를 항상 요구하기보다는 "아마 대략적으로 정확한 것" (PAC) 프레임 워크를 포함한다. 우리는 문자열의 일부 분포와 훈련 데이터 T의 일부 분배에 대해서만 요구할 수 있다. 우리는 PT (PD (∉L (G ') ∩L (G)) <ε) ≥1-δ 이 필요하다.

in other words, the two grammars disagree on at most some bounded proportion of strings (i.e. our proposed grammar is “approximately correct”) and that this “probably” occurs, learning such an approximately correct grammar on at least 1-δ of all possible training sets.
두 문법은 거의 한정된 비율의 문자열에 대해 일치하지 않으며, 이는 가능한 모든 훈련 세트 중 최소한 1-δ에서 대략 정확한 문법을 배운다.

Regardless of the family, general results considering these various models are not encouraging. One of the earliest considered cases was examined by Gold [8]: that of learning in the limit from “presentations” as described above, where the learner is provided with data, cannot form its own queries, and must eventually converge to an equivalent grammar (it may only make finitely many mistakes).
이러한 다양한 모델을 고려한 일반적인 결과는 고무적이지 않다. 가장 초기에 고려된 사례 중 하나는 [8]에서 검토되었다. 위에서 설명한 "프리젠테이션" 한계에서 학습자가 데이터를 제공받고 자체 쿼리를 작성할 수 없으며 궁극적으로 동등하다.

In this area of learning by examples, Gold’s Theorem [8] demonstrates learning even regular languages is not possible from only positive examples (indeed, any class that contains all finite languages and at least one infinite language cannot be inferred from a positive presentation).
예제에 의한 학습 영역에서, Gold 's 정리는 긍정 예제만으로는 정규언어조차도 배울 수 없음을 보여준다.

Given a finite set of both positive and negative examples, the work of [10] shows that the decision question of whether there is an n-state DFA that agrees with the finite data set is NP-complete.
긍정과 부정 예제의 유한 집합을 감안할 때, [10]의 작업은 한정된 데이터 집합과 일치하는 n-state DFA가 존재하는지 여부에 대한 결정 문제가 NP-complete(컴퓨터로 해결 불가능한 문제?)임을 보여준다.

Through a reduction to Boolean formulas, the work of [11] shows that PAC learning of DFAs from both positive and negative examples – the standard “supervised learning” context perhaps more familiar to those with a background in machine learning – is at least as hard as factoring Blum integers, and so under current cryptographic assumptions does not appear to be polynomial-time tractable.
불린 수식의 감소를 통한 [11]의 연구는 긍정과 부정 예제에서 DFA의 PAC 학습이 적어도 Blum 정수를 인수분해하는 것만큼 어렵다는 것을 보여 주며, 따라서 현재의 암호화 가정은 다항 시간으로 다루기 어려워 보인다.

As DFAs occupy one of the lowest levels of the Chomsky hierarchy, it immediately follows that any grammar that requires more power than that offered by a DFA must be at least as hard as a DFA to learn.
DFA는 촘스키계층의 가장 낮은 레벨 중 하나를 차지하므로 DFA가 제공하는 것보다 더 많은 힘을 필요로 하는 문법은 적어도 DFA만큼 배우기 어렵다.

Active learning – in which the learning algorithm can form some kind of query to narrow its search – provides slightly more positive results. The closest counterpart to the notion of active learning encountered in (non-grammatical inference oriented) machine learning is the model of an “informant,” in which case the algorithm is allowed to construct query strings, and has access to an oracle which responds to membership queries for proposed strings. When learning with an informant, context-sensitive grammars may be learned in the limit [8].
능동적 학습은 조금 더 긍정적인 결과를 제공한다. 기계학습에서 발생하는 능동 학습에 가장 가까운 것은 "정보 제공자" 모델로, 알고리즘은 쿼리 문자열을 구성할 수 있으며 제안된 문자열에 대한 멤버 자격 쿼리에 응답하는 오라클에 액세스할 수 있다. 정보 제공자와 함께 학습할 때, 문맥에 민감한 문법을 제한적으로 학습할 수 있다.

The more powerful model of a “teacher” as described above (an informant who will provide counterexamples to proposed grammars) allows for yet more powerful inference. Under the model of a teacher, some context-free grammars may be learned in polynomial time with respect to the number of states in the underlying automata [12].
위에서 설명한 "교사"의 더 강력한 모델은 더 강력한 추론을 가능하게 한다. 교사 모델에 따라, 일부 context-free 문법은 기초적인 오토마타의 상태 수와 관련하여 다항 시간으로 배울 수 있다.

Finally, the use of “simple examples” as training data allows for learning minimal DFA in polynomial time [13], however simple examples – informally – are training items carefully chosen to guide the learning algorithm to the correct answer, and so in addition to being specific to the learning algorithm and the grammar, do not correspond to a more conventional learning approach where the distribution of the input data cannot be controlled.
마지막으로 학습 데이터로 "간단한 예제"를 사용하면 다항 시간에 최소 DFA를 학습할 수 있지만 간단한 예제는 학습 알고리즘을 정답으로 안내하기 위해 특화된 교육 항목이므로 입력 데이터의 분포를 제어할 수 없는, 일반적인 학습 접근법과 일치하지 않는다.

In addition to variations in available input to the algorithm, various modifications of the grammatical classes to which the unknown language may belong have also been considered. The class of pattern languages [14] can be shown to be learnable in the limit from positive data due to the property of “finite thickness” [15] – loosely, that there is no string that is found in infinitely many members of the class and therefore for any given subset of grammars in the class there must be some production which is unique to it – which is shown to be sufficient for identification in the limit.
알고리즘에 대한 이용 가능한 입력의 다양성에 더하여, 알려지지 않은 언어가 속할 수있는 문법 클래스의 다양한 변형도 고려되었다. 패턴 언어의 분류는 한계에서 식별하기에 충분하다고 보여지는 "유한 두께"의 특성으로 인해 긍정 데이터로부터 한도 내에서 학습 가능하다고 할 수 있다.

While pattern languages themselves are difficult to apply to many empirical problems in intrusion detection [16], the notion of finite thickness may in many cases be applicable. A related notion is that of elasticity [17]. A class of languages has “infinite elasticity” if and only if one can find an infinite sequence of productions such that for all n, the first n productions are found in some grammar in the class while production n+1 is not;
패턴 언어 자체는 많은 경험적 침입탐지 문제에 적용하기가 어렵지만 유한 두께 개념이 적용될 수 있다. 관련된 관념은 탄력성이다. 언어 부류는 모든 n에 대해 무한한 연속성을 찾을 수 있는 경우에만 무한 탄력성을 가지며, n+1생산이 아닌 부류의 일부 문법에서 첫번째 n생산이 발견된다.

if a class does not have infinite elasticity, then it is said to have finite elasticity and it can again be inferred in the limit from positive data. Finally, if a class of grammars has a family of “characteristic sets” (‘condition 1’ in [15]) such that each language in the grammar has an associated finite characteristic set, and showing that the characteristic set for language i exists within language j is sufficient to show that language i is a subset of language j, then languages from that class are learnable in the limit.
어떤 부류가 무한한 탄력성이 없을 때 유한한 탄력성을 가지며 긍정 데이터의 한계값으로 다시 추론할 수있다. 마지막으로, 문법의 각 언어가 관련된 유한 특성 집합을 가지며, j언어 내에 존재하는 i언어에 대한 특성 설정이 언어 i가 j의 하위 집합이며, 해당 클래스의 언어가 한계 내에서 학습 가능하다는 것을 보여준다.

Taken in whole, these results do not paint a particularly positive picture of grammatical inference. Even the simplest classes of grammars we might encounter have extremely poor “learnability,” and rendering them easier to learn often requires some degree of interactivity or side information.
전반적으로 문법 추론의 결과는 긍정적이지 않다. 우리가 접할 수있는 가장 단순한 문법 수업조차도 "학습 가능성"이 매우 낮으며, 학습을 더 쉽게 하려면 종종 어느 정도의 상호 작용이나 부가 정보가 필요하다.

Significant restrictions on the class of grammars which we may wish to perform inference on allows for substantial reductions in complexity, however we have little guarantee for novel protocols that we may rely on such restrictions. In the following section, we briefly examine major classes of machine learning that have been proposed for intrusion detection, and relate the results in this section to the practical problems posed by such learning systems.
추론을 수행하고자 하는 문법 분류를 제한하면 복잡성을 상당히 감소시킬 수 있지만, 우리는 그러한 제약에 의존할 수 있는 새로운 프로토콜에 대해서 거의 보장할 수 없다. 다음 섹션에서는 침입탐지를 위해 제안된 주요 기계학습 종류를 간략히 살펴보고 이 섹션의 결과를 이러한 학습 시스템이 제기한 실제 문제와 관련시킨다.

Machine learning and Intrusion detection

The literature on machine learning and intrusion detection is vast (see references in [1] for a partial overview; also, short reviews by [18] and [19] which contain more details about specific machine learning methods that have been attempted); however, it divides broadly into the two main categories of “anomaly detection” and “signature inspired” [20] (in machine learning terminology, “supervised learning”).
기계학습 및 침입탐지에 대한 분야는 광범위하지만 크게 "이상징후 탐지"와 "패턴매칭"으로 나뉜다.

Anomaly detection, including such systems as Anagram [21] and McPAD [22], focuses on constructing a “normal” model of traffic and producing alerts when traffic that does not fit this model is observed.
Anagram 및 McPAD와 같은 시스템을 포함한 이상징후 탐지는 트래픽의 "정상" 모델을 구성하고, 이 모델에 맞지 않는 트래픽이 발생할 때 경고를 생성하는 데 중점을 둔다.

Supervised learning systems (see [23], [24], and [25] for representative examples) are provided with both malicious and benign traffic, and attempt to learn rules to distinguish them. While less common in the domain of intrusion detection, active learning (i.e. interactive) approaches for outlier detection have been presented as well, as in [26].
지도학습 시스템에는 악성 트래픽과 양성 트래픽이 모두 제공되며 이를 구분하기 위한 규칙을 학습한다. 침입탐지 영역에서는 흔하지 않지만 [26]에서와 같이 특이치 검출을 위한 능동 학습 방법이 제시되었다.

In all cases, the general formulation of the problem is approximately the same. Network messages are – by construction – designed to be parsed and interpreted by a machine, and hence can be characterized as formal grammars which accept and reject on specific strings.
모든 문제의 일반적 공식은 거의 같다. 네트워크 메시지는 기계에 의해 파싱되고 해석되도록 설계되었으므로 특정 문자열을 수용하고 거부하는 공식 문법으로 특징지어질 수 있다.

Within the space of all possible strings M received by the network service, there is the subset A⊆M of messages that are accepted by the network endpoint; this subset itself contains S⊆A of “safe” messages that represent normal use of the endpoint. In learning a function that can identify which subset of a given message lies within – in particular for some string s whether or not s∈S – we are constructing a recognizer for S, thus placing the problem exactly within the domain of grammatical inference.
네트워크 서비스에 의해 수신된 모든 가능한 문자열 M의 공간 내에, 네트워크 종단점에 의해 수용되는 메시지의 하위 집합 A⊆M이 있다; 이 하위 집합 자체는 엔드 포인트의 정상적인 사용을 나타내는 "안전" 메시지의 S⊆A를 포함한다. 주어진 메시지의 하위 집합을 식별할 수 있는 함수를 학습할 때 우리는 S에 대한 인식기를 구성하여 정확하게 문법적 추리의 영역 내에 문제를 배치한다.

Consider, for example the case of a standard machine learning approach in which we are presented with samples of normal traffic from S, and hostile traffic from the set A\S, each appropriately labeled, and we wish to train a function to determine whether a candidate string lies in one set or the other. This is precisely the problem of learning a grammar from a complete presentation, and as such we may readily apply existing results.
S로부터의 정상적인 트래픽의 샘플과 적절한 A / S로부터의 적대적인 트래픽의 표본이 제시되는 표준 기계학습 접근을 고려해 보자. 후보 문자열은 하나의 집합 또는 다른 집합에 있다. 이것은 완전한 프리젠테이션에서 문법을 배우는 문제. 따라서 기존 결과를 쉽게 적용할 수 있다.

Even if we make the simplifying assumption that the protocol under consideration (or the union of protocols if deployed on a multi-protocol endpoint) is regular, we are still attempting to learn a Discrete Finite Automata from a complete presentation.
고려중인 프로토콜이 규칙적이라는 가정을 하더라도, 우리는 완전한 프리젠테이션에서 DFA를 배우려고 시도하고 있다.

If we wish to learn it precisely (in the limit) then we have that the problem is NP-complete [10]. If we wish to learn it in a more practical sense (i.e. PAC), then we have that the identification problem is “merely” cryptographically hard [11].
우리가 정확하게 그것을 배우고 싶다면 문제는 NP-complete이다. 좀 더 실용적인 의미로 배우고 싶다면, 식별 문제는 암호화 방식으로 어렵다는 것을 알게 된다.

This forces us to accept the conclusion that even if we obtain empirically good performance for a particular algorithm in a particular setting, we cannot be sure that it will generalize to a new domain.
특정 환경에서 특정 알고리즘에 대해 경험적으로 좋은 성능을 얻더라도 새로운 도메인으로 일반화할 수 있는지를 확신할 수 없다.

If we consider (as in McPAD [22]) that only positive (‘normal’) data S is available, and continue to assume that we are observing then we are attempting to learn a grammar from its positive presentation, with all associated complexities.
우리는 긍정적인 데이터 S만을 사용할 수 있다고 생각하고, 우리가 관찰하고 있다고 가정하며, 긍정적인 표현으로부터 모든 문법적 복잡성과 함께 문법을 배우려고 시도하고 있다.

While specific examples of grammars are clearly at least PAC-learnable in this setting, as shown by the results of [22], it follows immediately from the difficulty of learning from a positive presentation that McPAD must fail to generalize to at least some classes of grammars;
PAC- 학습이 가능한 것은 분명하지만 [22]의 결과에서 알 수 있듯이 McPAD가 적어도 몇 가지 클래스로 일반화하지 않으면 안된다

whether or not those grammars are of practical relevance to intrusion detection cannot be decided in any fashion other than empirically. We are thus left with no foundational guarantee of correctness; simply empirical observations.
이러한 문법이 침입탐지와 실질적으로 관련이 있는지 여부는 경험적인 방법 외에는 어떤 방식으로도 결정될 수 없다. 따라서 우리는 정확성에 대한 기본적인 보장은 못하고 단순히 경험적 관찰만 할 수 있다.

Clearly, when we consider a more realistic scenario in which several protocols may be present in the same set of network traffic, the problem often becomes significantly more difficult;
여러 프로토콜이 동일한 네트워크 트래픽에 존재하는 좀 더 현실적인 시나리오를 고려할 때 문제는 훨씬 더 어려워진다.

the problem of learning the mixture of grammars is at a minimum at least as hard as learning the most complex one, and depending upon the closure properties of that class, may in fact be more difficult. While most languages in the Chomsky hierarchy are in fact closed under unions, it is often not clear whether or not restricted classes (such as those that have finite thickness) may be.
문법의 혼합을 배우는 문제는 최소한 가장 복잡한 문법을 배우는 것만큼이나 어렵고, 그 문법의 폐쇄 속성에 따라 실제로는 더 어려울 수 있다. 촘스키 계층 구조에서 대부분의 언어들이 사실상 결합되어 있지만, 종종 제한된 분류가 있는지 없는지는 명확하지 않다.

While somewhat outside the realm of network intrusion detection, more powerful inference models such as learning from an informant have been shown to generate positive results in areas such as developing usable models of program execution (stateful typestates) [27].
네트워크 침입탐지가 아닌 분야에서는 정보 제공자로부터의 학습과 같은 보다 강력한 추론 모델이 긍정적인 결과를 낳았다.

This approach obtains a direct generative model of program outputs which can be examined for various security properties.
이 접근법은 다양한 보안 속성에 대해 검사할 수 있는 프로그램 출력의 직접 생성 모델을 가져온다.

Standard fuzzing techniques [28] are perhaps a more direct application within the security domain, in which a subset C⊆A of inputs that lead to crashes is learned from interaction with a program, frequently combined with additional information about the execution of code paths [29],
표준 퍼지 기법은 아마도 보안 영역내에서 보다 직접적인 애플리케이션으로서 충돌을 일으키는 입력의 하위 집합 C⊆A가 프로그램과의 상호 작용을 통해 학습되며, 종종 추가적인 코드 실행 정보와 결합된다.

however these methods do not typically produce formal descriptions or generative models of incorrect inputs, and rather seek to enumerate a useful subset of them, typically (in defensive settings) attempting to reduce the size of the set A.
그러나 일반적으로 잘못된 입력에 대한 형식 설명이나 생성 모델을 생성하지 않고 오히려 집합의 크기를 줄이려고 시도하는 하위 집합을 열거한다.

The work of [30] explores methods to leverage attacker knowledge in constructing fuzzing inputs via a descriptive language, which could be used in an iterative fashion to eventually describe a subset of the target grammar.
[30]의 작업은 기술 언어를 통해 퍼지 입력을 구성하는 데 있어 공격자의 지식을 활용하는 방법을 탐구한다. 이 언어는 대상 문법의 하위 집합을 설명하기 위해 반복적으로 사용할 수 있다.

In many cases, we do not even require the results of grammatical inference to show that a particular classifier cannot (provably) learn a sharp distinction between malicious and benign traffic.
많은 경우, 우리는 특정 분류기가 악의적인 트래픽과 양성 트래픽 간의 명확한 구분을 배울 수 없음을 보여주기 위해 문법적 추론의 결과를 요구하지도 않는다.

A key step in any machine learning process is that of ‘feature extraction’ in which the raw data that is to be classified is converted into some numerical representation that can then be operated on by the learning algorithm.
모든 기계학습 과정의 핵심 단계는 분류될 원시 데이터가 학습 알고리즘에 의해 작동될 수있는 수치 표현으로 변환되는 '피처 추출'이다.

N-grams (and minor variations on the concept such as skip-grams) are the core feature representation used in a number of anomaly-based intrusion detection systems, including Anagram [21] and McPAD [22], in which every n-byte substring within the payload is tabulated (for instance, the string “learning” would have 3-grams of “lea”, “ear”, “arn”, “rni”, and so on).
N-그램은 Anagram 및 McPAD를 포함하여 수많은 비정상 기반 침입탐지 시스템에서 사용되는 핵심 기능이다. 여기서 페이로드 내의 모든 n 바이트 부분 문자열이 도표화된다.

However such representations can be shown to be insufficiently powerful to distinguish between many members of the class of regular languages.
그러나 그러한 표현은 정규언어의 많은 구성원을 구별하기에 충분하지 않다.

For example, the rather trivial regular languages (ab)x(ba)y(ab)z and (ba)x(ba)y(ba)z cannot be distinguished from each other on the basis of 2-grams (note that the 2-grams bb and aa both appear exactly once in each, with variable numbers of ab and ba tokens), while constructing a recognizing DFA for each is trivial. Similar counterexamples can be constructed for n-grams of arbitrary length.
예를 들어, 정규언어 (ab) x (ba) y (ab) z와 (ba) x (ba) y (ba) z는 2-gram을 기준으로 서로 구별할 수 없다. 각각에 대해 DFA를 인식하는 것은 간단하다. 유사한 countererexamples는 임의 길이의 n-gram으로 구성할 수 있다.

This immediately implies that any learning algorithm that first reduces a sequence to n-gram counts is a priori incapable of learning large subsets of the class of regular grammars, and as a consequence we may expect that – even if empirically good performance is shown on some data sets – this performance cannot be relied upon to generalize.
이것은 곧 시퀀스를 n-gram 카운트로 감소시키는 학습 알고리즘은 정규 문법 분류의 많은 하위 집합을 학습할 수 없기 때문에 이 기능을 일반화할 수 없음을 의미한다.

Other feature representations are also used. Perhaps most widely known are those of the (now severely out-of-date but still regularly studied) KDD’99 data set [31], which parses the network traffic to extract a number of features suggested by expert knowledge to be of use.
다른 특징 표현도 사용된다. 아마도 가장 널리 알려진 것은 KDD'99 데이터 세트로, 네트워크 트래픽을 구문 분석한 후, 전문 지식에 의해 제안된 여러 가지 기능을 추출하여 사용할 수 있다.

In addition to “metadata” describing flow-based properties such as the duration of the connection and the number of concurrent connections to the same host, a number of content-based features are extracted from both the payload of the packets (e.g., the presence of a shell prompt, extracting FTP commands to obtain a tally of outbound ones, etc.) and headers of the packets (identifying the network protocol, various ports and addresses, protocol-specific flags, and so on).
연결 기간 및 동일한 호스트에 대한 동시 연결 수와 같은 흐름 기반 속성을 설명하는 "메타 데이터" 외에도 패킷 페이로드와 헤더에서 많은 콘텐츠 기반 기능이 추출된다.

These features obviously make no attempt to model any significant portion of the content of the packets, and so make the prospect of inferring a grammar from them infeasible; at best, some of the manually extracted features act as “telltales” for specific attacks, and thus allow what is effectively signature-based detection.
이러한 특징들은 패킷 내용의 중요한 부분을 모델링하지 않으므로 문법 추론 가능성이 없다. 기껏해야 수동으로 추출된 기능 중 일부를 이용해서 특정 공격에 대한 서명 기반 탐지를 할 수 있다.

And indeed, the most effective current approach in intrusion detection remains (anecdotally at least) signature-based solutions [1] such as Snort [3].
실제로 침입탐지에서 현재 가장 효과적인 방법은 Snort와 같은 서명 기반 솔루션이다.

The effectiveness of such solutions can be explained precisely within the context of grammatical inference, as a well-written content-based signature is equivalent to a production that is not (or is very rarely) a production of the grammar underlying “good” traffic, and hence form a telltale for the set A\S.
그러한 솔루션의 효과는 문법 추론의 맥락에서 정확하게 설명될 수 있다. 잘 작성된 내용 기반 서명은 "양호한" 트래픽의 기초가 되는 문법의 생산이 아니며 따라서 집합 A\S에 대한 안내문을 형성한다.

And indeed, Snort and Bro [2] both contain sophisticated pattern-matching rules that are capable of recognizing a wide range of malicious traffic, in effect acting as small recognizers for subsets of the malicious grammars under consideration.
실제로 Snort와 Bro는 모두 악의적인 트래픽을 광범위하게 인식할 수 있는 정교한 패턴매칭 규칙을 포함하고 있으며, 실제로 악의적인 문법의 하위 집합에 대한 작은 인식자의 역할을 한다.

It is worth noting, however, that the rule generation in this case is often done by hand, and even when done in an automated fashion is typically attempting to match a finite subset of malicious traffic for a specific attack, and then tested on a set of larger normal traffic to assess false positives; this is equivalent to the bounded version of the problem posed in [10], which can take place in polynomial time.
그러나 규칙은 수동으로 생성되며, 자동으로 생성하는 경우에도 일반적으로 특정 공격에 대한 악의적인 트래픽의 한정된 하위 집합과 일치시킨 후, 더 큰 집합에 대해 테스트한다. 오탐(false positive)을 평가하기 위한 더 큰 정상적인 트래픽은 다항 시간에 발생할 수 있는 [10]에서 제기된 문제의 한정된 버전과 동일하다.

A key distinction here is that – rather than attempting to model all ‘safe’ or ‘malicious’ productions – any method that produces some form of signature is attempting to model a finite number of productions of a single protocol under heavily supervised conditions, and so does not address the question of novel attacks that machine learning-based solutions are often attempting to address [1].
여기에서 중요한 차이점은 어떤 형태의 서명을 생성하는 모든 방법이 엄격한 감독 조건 하에서 단일 프로토콜의 한정된 수를 모델링하므로 기계학습 기반 솔루션이 자주하는 새로운 공격에 대한 문제를 다루지 않는다.

Conclusion

The high performance of machine learning in other domains has stimulated significant interest in applying it to network security, however (as noted in [1]), despite the breakneck pace of major successes with machine learning in many other domains, and the large amount of effort spent to produce machine learning-based intrusion detection systems, in practice most major network defense providers focus continue to use signature-based methods which have been in active use since the late 1990’s.
많은 영역에서 기계학습이 성공했고, 네트워크 보안 분야에서도 적용하는 데 많은 노력을 기울였다. 하지만 현실적으로 침입탐지 시스템의 경우, 대부분의 주요 네트워크 방어 업체는 1990년대 후반부터 사용되어온 서명 기반 방법(패턴매칭)을 계속 사용하고 있다.

Drawing on the extensive literature on grammatical analysis, we propose that this is a reflection of a fundamental difference between more conventional domains of machine learning and network security.
문법 분석에 대한 광범위한 문헌은 기계학습이 적용되는 네트워크 보안과 전통적인 영역 사이에 근본적인 차이가 있음을 알려준다.

In particular, because network security – particularly network security applications that focus on analysis of packet contents – operates on the domain of formal grammars that are rigorously interpreted (as compared to the domain of natural language translation, where human intuition can often “fill in the gaps” in translation), it is an intrinsically difficult problem that a) is demonstrably intractable in the most general case, and b) cannot be addressed with the relatively crude features that appear to be most common in the literature.
특히 네트워크 보안은 엄격한 해석이 필요한 공식 문법의 영역에서 작동하기 때문에 문헌에서 가장 보편적으로 나타나는 비교적 조잡한 특징으로 해결할 수 없다.

While some modest success has been recently realized in applying sequence-to-sequence models (thus at least partially avoiding the question of feature spaces) for grammatical inference in specific instances of specific protocols [32], there remains no method to demonstrate that such methods will generalize even to different instances of the same protocol, let alone novel protocols in the same class.
특정 프로토콜의 특정 사례에서 문법적 추론을 위한 시퀀스 - 시퀀스 모델을 적용할 때 약간의 성공이 있었지만, 그 방법이 동일한 프로토콜의 다른 인스턴스에도 일반화될 수 있음을 입증할 수 있는 방법은 없다.

In fact, results from grammatical inference show that there is quite likely no general method that can be applied to arbitrary data to separate benign and malicious traffic; any practical method should therefore be restricted to a particular domain, analyze that domain carefully, and at least attempt to investigate what properties of the protocol under analysis may allow it to be effectively learned.
문법 추론 결과는 양성 및 악성 트래픽을 분리하기 위해 임의의 데이터에 적용할 수 있는 일반적인 방법이 거의 없다는 것을 보여준다. 따라서 실제적인 방법은 특정 도메인에만 국한되어야 하고, 해당 도메인을 신중하게 분석해야 하며, 분석 중인 프로토콜의 속성이 효과적으로 학습되도록 허용할 수 있는지 조사해야 한다.

The empirical effectiveness of Snort and Bro signatures suggest that the domain of malicious traffic is likely more tractable, and may be easier to learn. The appearance of particular byte sequences in malicious but not benign traffic can be viewed (informally) as evidence that the class of malicious languages is of finite elasticity (due to the absence of a limit language) within the class of all protocols that can produce accepting inputs to the system under consideration, thus supporting identifiability.
Snort 및 Bro 패턴매칭의 경험적 효과는 악의적인 트래픽의 영역이 다루기 쉽고 배우기 쉽다는 것을 의미한다. 악의적인 트래픽에서 특정 바이트 시퀀스의 출현은 악의적 언어 클래스가 고려중인 시스템에 대한 입력을 받아들일 수 있는 모든 프로토콜 클래스 내에서 유한한 탄력성을 가지므로 식별 가능성을 지원한다는 증거로 볼 수 있다.

Feature representation is also important. N-gram based features in particular will quite often be insufficiently powerful to model complex grammars or protocols; in some cases, sufficiently large values of n may be able to overcome this limitation for specific subclasses of protocols, however this is likely to be highly problem specific, and requires careful evaluation for any given proposed system.
피처 표현도 중요하다. 특히 N 그램 기반 기능은 복잡한 문법이나 프로토콜을 모델링하는 데 충분하지 못할 수 있다. 어떤 경우에는 n의 충분히 큰 값이 프로토콜의 특정 하위 클래스에 대해 이 제한을 극복할 수 있지만 매우 문제가 많을 것으로 예상되며 제안된 시스템에 대해 신중한 평가가 필요하다.

While significant open questions remain – such as methods for performing inference on the restricted classes of grammars that in practical terms make up many existing protocols – the immediate results of applying grammatical inference theory to machine learning for intrusion detection both help explain the lack of widespread adoption of such systems, and suggest appropriate avenues for future work.
문법 추론 이론을 적용한 결과 침입탐지 시스템은 기계학습을 적용하기 어렵다?

관련 글

Easy to analyze if you are really curious about data

페이지

2018년 6월 17일 일요일

기계학습과 침입탐지: 문법 추론 결과

댓글 없음:

댓글 쓰기

크리에이티브 커먼즈 라이선스